Skip to main content.
home | support | download

Back to List Archive

Re: swish-e on a large scale

From: Aaron Levitt <alevitt(at)not-real.apple.com>
Date: Thu Sep 30 2004 - 18:17:21 GMT
On Sep 30, 2004, at 9:25 AM, Bill Moseley wrote:

> On Thu, Sep 30, 2004 at 08:51:35AM -0700, Aaron Levitt wrote:
>> I began the indexing approximately 72 hours ago, and it hasn't ended
>> yet.  It is running on a G3 450Mhz machine with  576Mb of RAM.  I can
>> see swish-e hitting my webserver, and the .temp database seems to
>> continue to grow.  I ran the indexer with the following command:
>> ./bin/swish-e -S prog -c swish.conf.
>
> Are you using the -e option?  If not you have likely run out of RAM,
> or at least the hash tables are getting so big that indexing has
> slowed way down.  Did you look at free(1) and vmstat(8) and other
> tools to see how your machine is holding up?

I didn't use the -e option.  My available RAM hovered around 20Mb for 
the last 24 hours or so, but it didn't actually die, but I think you 
are right about it slowing down. ;)

> Did you test things out with smaller sets of files first?

Initially, I tried to get it to index a smaller set of files, but I 
can't alter my current robots.txt, and FileRules/FileMatch doesn't seem 
to work with the spider.pl script (unless I missed something).  If you 
have a suggestion on how to get spider.pl to only index certain dirs, I 
would appreciate it.  This way I can also run seperate indexes.

>> So, I have the following questions:
>>
>> 1. I expect to have over 1,000,000 documents in our archives as things
>> progress.  Is this pushing the limits of swish-e?
>
> I'm tempted to say yes, but I know others on the list have/are
> indexing that many docs.

Judging from the responses I've received from the list, I think this 
isn't an issue. =)

> The basic problem is swish is designed to
> use RAM to be fast -- but also Jose has added features like -e and
> also a new btree database back-end (not enabled by default).

How do I use the new btree database back-end?  Would this be better 
than the default?

>> 3. What should I do regarding the current index process?  I'm afraid 
>> to
>> stop it, because I don't want to have to start the indexing all over
>> again.
>
> Well, you can strace the process to see what it's doing.  But even the
> spider doesn't know when it will be done until it's actually done.

I finally gave up, and sent a sighup to spider.pl.  I was off about a 
day in the time it has taken.  Also, I think if I had been a bit more 
patient, it would have actually finished.  I believe there are roughly 
700,000 documents at the moment, and it got through over 630,000.  The 
results are below.

>> 4. Do you have any recommendations on what I can do to improve this
>> process?
>
> A few ideas
>
> In general, you can spider separately from indexing.  Just capture the
> output from the spider to a file and when done pipe that into swish.

This might be an alternative, can you give me a little more detail on 
how the best way to do this would be?

> You might be able to index the raw email messages faster than
> spidering the mail archive.

Unfortunately, I can't do this, due to production environment 
limitations.  Also, for the sake of protecting email addresses, it has 
to be done via http rather than directly from the file system.

> Since you are indexing a mail archive (where old messages don't
> change) then you should try building swish with the
> --enable-incremental option.  And then you can *add* files to the
> index as needed.  It still requires some of the normal processing (like
> presorting all the records) but should be faster that reindexing.

So, at this point I am going to move swish to a faster box with about 
double the RAM, and I will rebuild swish with that option.  Should I 
continue to use the database the way it is, or should I try the btree 
database back-end?  I have also been considering the mysql route.

> That help at all?

Yes, it helped immensely, along with the other responses I got from 
other list members. I am guessing that the real issue is with the 
indexing via the spider, and not the database.  I will try it again 
with the incremental option built in, and see if I can get it indexed 
with a day or two.

Last but not least... the results of the indexer's first run:

475,944 unique words indexed.
5 properties sorted.
637,449 files indexed.  2,932,324,538 total bytes.  231,714,672 total 
words.
Elapsed time: 47:57:02 CPU time: 04:30:30
Indexing done!

Thanks again for all the input!

-=Aaron
Received on Thu Sep 30 11:17:44 2004