Skip to main content.
home | support | download

Back to List Archive

Lucene/Nutch (WAS: converting .temp indices...)

From: Dave Stevens <dstevens(at)not-real.roaddog.com>
Date: Sun Dec 07 2003 - 12:10:47 GMT
>> This machine is a single Athlon XP 1800+ with a an inexpensive Asus K7
>> board and only 512 MB of RAM.
>
> I assume you are using -e when indexing.

Actually no.  I was more or less looking at the limitations using Swish-e
knowing it it wasn't really intended to handle somewhat massive data sets
and that my resources (particularly RAM) might be taxed.  I'll try the -e
switch on the next crawl and see if it helps.  I'm adding another 512MB of
RAM and see what that does.  When I SIGHUPed the last crawl at 96 hours,
the pages were being returned at the rate of two or less a minute, much
slower than any of the other crawls.

Another issue I could see was that after the spider had been crawling for
several hours, unless Apache was restarted a couple of times a day, PHP
performance would tank to the point where PHP/mysql pages would take
upwards of a minute to serve, though html and cgi/DBI/mysql pages loaded a
little slower than normal, still acceptable.  Just don't have enough box
at the moment.


> On such a large scale you need something where you can incrementally
> update the index.  Frankly, if documents are available locally I think
> completely reindexing with swish-e is often as fast as updating other
> types of indexes.  Maybe.

Using Swish-e when I was operating my previous sites we did all the
indexing using FS.  Blazingly fast it was, and the last deployment was
about a year ago just before the current regime "excused" the entire data
center, development and management staff just after the purchase.  I'd
like to see how it does with 2.4.0.

Incremental indexing is an issue anyway as most of this content is
community generated, message posts, classifieds, blog entries, etc and
even with the "Big Boy" crawlers it's not possible to aggregate  up to the
minute info from several sources.  I think this is where tools like RSS
feeds and P2P like Jabber may do a better job for real time info, but it
still has to be archived and indexed somewhere.



> Another to look at, if you can stand java, is Lucene.  I haven't tried
> it but their goal is an Open Source large-scale search engine.

Just spent a couple of hours boning up on it, thanks for the pointer. 
Lot's of good info there.  Unfortunately for me, my Java experience
consists of three Sun training days nearly four years ago.  Lucene is only
the API and to build a complete app it requires the developer to build or
implement the parsers, the front end and crawler.  Lucene really only
provides text indexing and searching, though from many reports it is said
to be pretty speedy.

FWIW, Doug Cutting started Lucene and with his involvement in Nutch I'm
not sure how much he contributes to Lucene any more, though there seem to
be many smart folks working on Lucene.  One interesting thing, to me
anyway is that Nutch indices are in Lucene index format.  Nutch is also
written in java but it's distributed as a complete app (search for Nutch
at Sourceforge) though it looks like it may be several months, if not a
year or better before it's to point where it's able to be used by more
than just developers or enthusists.  They say that last June they built a
100 million page archive prototype but don't have the hardware to provide
it publically.

Dave
Received on Sun Dec 7 12:10:51 2003