Skip to main content.
home | support | download

Back to List Archive

Re: Use of swish-e in BaBar high energy physics exp.

From: Douglas Smith <douglas(at)not-real.SLAC.Stanford.EDU>
Date: Tue Apr 15 2003 - 21:47:33 GMT
I am going to split this up into different threads:

On Tuesday 15 April 2003 02:10 pm, Bill Moseley wrote:
> Thanks good.
>
> > It also proved to be so much
> > faster than other engines we have been able to up the update
> > time to every 15mins (and it could probably handle every
> > 5mins),
>
> Could you post the output from indexing some time and note what hardware
> you are running on, and perhaps memory usage?
>

I don't have a lot of hard numbers.  This all got started because
ht-dig was being used, and the posting forums just grew to the
point where ht-dig just didn't scale any more.  At the end, it
woudl take ht-dig over 20 hours to initally create the search
index of the 150,000 posting, then when it tried to merge in the
incremental index it would end up with a corrupted database.
There was no easy way to search multiple indexes like in swish-e.
Tests on harvest proved it was not useful for us in the end.
Inktomi can index the discussion forums, but would produce 
inconsistant replies to search, which people found frustrating.
Google sent us a test box, and it worked just fine for us, but
then other people got into the discussion about how much it would
cost who and when, and where would the box sit, and security 
control... The discussions have not gotten done yet.

But swish-e is now working.  The test were done with two machines
both dual pentium III 866MHz, with 1GB of memory.  One was the
web server for the forums, and the other was the swish-e server
to make the index, and provide a web server for the search cgi's.

The tests started with searching using the spider on the forum
web server.  After tuning exactly which pages should be indexed,
we got to the core 150,000 postings, and it would take 9.5-10 hours
to index the pages.  Durring this time the forum server would
be at 40% cpu load with some of that the web server, and the rest
is the file access.  The swish-e server would be at ~30% cpu load,
devided almost equally between the swish-e executable and the 
spider.pl, but these numbers would vary durring indexing.  The 
swish-e executable would end up using about 20% of memory by the 
end.

Then I went on the create a specific program to feed forum postings
to the swish-e executable without the web server, so I could better
control what exactly got indexed and setup a bunch of meta tags
to give info on each posting.  This took the forum server out of
the picture, and all load was on the swish-e server only.  After 
tuning which files to index this brough the time down to 1.5 hours
for 150,000 postings.  Durring this time the cpu load was shared
between swish-e and file access, I can't remember the numbers but it
was about 60% loaded.  THe swish-e executable in the end was using
20% memory as before.

So that is that story. (I know, too many details...)

Douglas

-- 
-----------------------------------------------------------
Douglas A. Smith                  douglas@slac.stanford.edu
Office: Bld 280, Rm 157                       (650)926-2369
-----------------------------------------------------------
Received on Tue Apr 15 21:48:32 2003