Skip to main content.
home | support | download

Back to List Archive

Re: converting .temp indices to usable indices

From: Dave Stevens <dstevens(at)not-real.roaddog.com>
Date: Sun Dec 07 2003 - 05:23:22 GMT
> Pushing the design of swish-e, perhaps.  Seems like more people are
> using swish-e for large collections.  How much RAM are you using?

This machine is a single Athlon XP 1800+ with a an inexpensive Asus K7
board and only 512 MB of RAM.  The indices were built using six different
crawls, increasing the number of sites spidered with each crawl.  The
first was a test run of a single site, about 250,000 text files from a
message forum.  It took about 14 hours and not long into it the was tapped
out of RAM.  I previsouly had access to the file system and using FS it
took much less time.  If I can make this work with at least 20 sites or so
(and I think I can), I'll build something like a dual Opteron with 4GB or
so of RAM and a server quality board and array.  I've also got a couple of
Sun 220R two way 440s with 2GB of RAM each collecting dust in storage but
I'd rather do this with open and free software on more or less commodity
servers.

I've got a good handle on what is required for mid level commercial sites
from my last six years or so doing that.  It's the search technology and
methods the spiders, indexers and searches use that's my steepest learning
curve now and there's not too much info on building larger scale search
engines freely available.  My first inclination was that the index file
sizes are going to exceed the limits of Linux and ext3 so I'm testing for
what the final index size is.  It may be moot as the 20 or so people
testing it (about 1000 use it on a regular basis) are indicating that it
may be better to use more segmented categorized indexes as opposed to a
gigantic index.  At any rate I've got five Sun boxes available and a
StorEdge array so I don't think storage or file sizes will be an issue in
any case.

> Did you look at inktomi?  It uses a database that is searchable as it is
> indexed.

No, but I will.  I've looked at Nutch but it doesn't seem like much is
going on, though they do post snaps fairly regularly and I've heard they
have anonymous access to CVS but haven't used it.  Last year I looked at
Google appliances for my former sites and it was a couple hundred grand
for a two year license.  At this point there is no investment in the
current project other than me (don't even have a business model yet) so I
need to make it work with a free sort of software license.  I'm willing to
spend a few grand in hardware and move back into a colo and support that
(about half a dozen boxes live in what was my dining room) but I won't be
able to afford licensing any enterprise level software.


> You might also consider writing the output from the spider to a local
> database of some type that allows updating over time.  And then have
> swish-e index that local cache.

Thanks.  That's something I could do without spending much if any money.


> You mean the two URLs return the same content?  One solution is to use
> the md5 check to filter those out,

I use MD5 and it works great, even on dynamic content.  These pages have
the same information, but the other elements (logos, ads, external links,
etc) are for different brands so while the pages contain the same
information, in reality they render as different pages based on that
particular brand.  The page sizes are sized a few hundred bytes different
or more.


> Yes.  You send the spider a SIGHUP and the spider will stop spidering
> and swish-e will write the index.

Excellent, thank you.


I don't think ORA has a book with a black widow on the cover yet.
Go ahead and put me down for five books....;-)

Dave
http://charlotte.roaddog.com/
Received on Sun Dec 7 05:23:27 2003