On Sep 30, 2004, at 9:25 AM, Bill Moseley wrote:
> On Thu, Sep 30, 2004 at 08:51:35AM -0700, Aaron Levitt wrote:
>> I began the indexing approximately 72 hours ago, and it hasn't ended
>> yet. It is running on a G3 450Mhz machine with 576Mb of RAM. I can
>> see swish-e hitting my webserver, and the .temp database seems to
>> continue to grow. I ran the indexer with the following command:
>> ./bin/swish-e -S prog -c swish.conf.
>
> Are you using the -e option? If not you have likely run out of RAM,
> or at least the hash tables are getting so big that indexing has
> slowed way down. Did you look at free(1) and vmstat(8) and other
> tools to see how your machine is holding up?
I didn't use the -e option. My available RAM hovered around 20Mb for
the last 24 hours or so, but it didn't actually die, but I think you
are right about it slowing down. ;)
> Did you test things out with smaller sets of files first?
Initially, I tried to get it to index a smaller set of files, but I
can't alter my current robots.txt, and FileRules/FileMatch doesn't seem
to work with the spider.pl script (unless I missed something). If you
have a suggestion on how to get spider.pl to only index certain dirs, I
would appreciate it. This way I can also run seperate indexes.
>> So, I have the following questions:
>>
>> 1. I expect to have over 1,000,000 documents in our archives as things
>> progress. Is this pushing the limits of swish-e?
>
> I'm tempted to say yes, but I know others on the list have/are
> indexing that many docs.
Judging from the responses I've received from the list, I think this
isn't an issue. =)
> The basic problem is swish is designed to
> use RAM to be fast -- but also Jose has added features like -e and
> also a new btree database back-end (not enabled by default).
How do I use the new btree database back-end? Would this be better
than the default?
>> 3. What should I do regarding the current index process? I'm afraid
>> to
>> stop it, because I don't want to have to start the indexing all over
>> again.
>
> Well, you can strace the process to see what it's doing. But even the
> spider doesn't know when it will be done until it's actually done.
I finally gave up, and sent a sighup to spider.pl. I was off about a
day in the time it has taken. Also, I think if I had been a bit more
patient, it would have actually finished. I believe there are roughly
700,000 documents at the moment, and it got through over 630,000. The
results are below.
>> 4. Do you have any recommendations on what I can do to improve this
>> process?
>
> A few ideas
>
> In general, you can spider separately from indexing. Just capture the
> output from the spider to a file and when done pipe that into swish.
This might be an alternative, can you give me a little more detail on
how the best way to do this would be?
> You might be able to index the raw email messages faster than
> spidering the mail archive.
Unfortunately, I can't do this, due to production environment
limitations. Also, for the sake of protecting email addresses, it has
to be done via http rather than directly from the file system.
> Since you are indexing a mail archive (where old messages don't
> change) then you should try building swish with the
> --enable-incremental option. And then you can *add* files to the
> index as needed. It still requires some of the normal processing (like
> presorting all the records) but should be faster that reindexing.
So, at this point I am going to move swish to a faster box with about
double the RAM, and I will rebuild swish with that option. Should I
continue to use the database the way it is, or should I try the btree
database back-end? I have also been considering the mysql route.
> That help at all?
Yes, it helped immensely, along with the other responses I got from
other list members. I am guessing that the real issue is with the
indexing via the spider, and not the database. I will try it again
with the incremental option built in, and see if I can get it indexed
with a day or two.
Last but not least... the results of the indexer's first run:
475,944 unique words indexed.
5 properties sorted.
637,449 files indexed. 2,932,324,538 total bytes. 231,714,672 total
words.
Elapsed time: 47:57:02 CPU time: 04:30:30
Indexing done!
Thanks again for all the input!
-=Aaron
Received on Thu Sep 30 11:17:44 2004