About a year ago I started a project for a vertical search engine for our
industry. I evaluated Lucene, swish-e and Nutch. Lucene is really a
search engine library and not a complete app. You'll have to roll your
own spiders and a way for the Webapp to display the results. By all
reports once you do that it is very powerful, but you have to build most
of the interfacing to your Webapp.
Doug Cutting started Nutch as well and was funded in part by Overture
until they got bought. It's been said that Yahoo has used parts of Nutch
in the Yahoo and over at Yahoo Labs there is a Nutch demo. We have
spidered the same sites (approx 2.5 mil pages) in our proof of concept in
both swish-e (using spider.pl) and Nutch using the native crawler tool.
Nutch isn't done and still in beta so there were some issues, though
generally it performed very well.
A couple of the drawbacks with swish-e for a large Web wide search tool
were spider.pl after long (700-800k pages, or a few days) crawls would
hang or become incredibly slow even on a dual Opteron 242 with 4GB ram.
Several times we had to kill the spider.pl proc and build the indices from
the partial crawl. Another thing was index file size, though at the time
the swish-e crew was working on large indices. Other thing that was a
bummer about swish-e was the lack of incremental indexing at the time.
IIRC this is also something they are working on. To be fair I don't think
the original intent of swish-e was to be a Web wide level search tool, but
it does a pretty good job up to a million or two pages. We were supposed
to launch the public portion of the search tool (using swish-e) last month
but the biz dev wonks are holding until after the first of the year to
coincide with a large trade show.
If you have control of the content to be searched and can use the file
system method or a combination of file system/ spider with multiple
indices, swish-e is an excellent tool though you may have to do some
workaround once you get past a couple million pages. It's so easy even a
project manager like me can get it up and going. ;-)
>> I am evaluating open source search engines for use
>> in a project where
>> the data to be indexed could get pretty big (a few
>> GB of documents,
>> each of about 10-20KB). I would like to hear the
>> experiences of
>> anybody who has used Swish-E in such a scenario. Any
>> hints/tips or
>> caveats to be aware of? What kind of search
>> performance can I expect
>> (given that I run the search on a recent machine
>> with lots of RAM)?
>> Also, has anyone compared Swish-E vs. Lucene in
>> terms of scalability
>> and performance?
>> Thanks in advance,
> Do you Yahoo!?
> Check out the new Yahoo! Front Page.
Received on Mon Nov 15 13:06:13 2004