On Sat, Dec 06, 2003 at 05:25:04PM -0800, Dave Stevens wrote:
> The spider is doing pretty well, nearly a million pages crawled in the
> last couple of weeks.
Pushing the design of swish-e, perhaps. Seems like more people are
using swish-e for large collections. How much RAM are you using?
Did you look at inktomi? It uses a database that is searchable as it is
You might also consider writing the output from the spider to a local
database of some type that allows updating over time. And then have
swish-e index that local cache.
> One issue I just came on is with a dynamic site
> that hosts several trade publications using a common app to provide
> content from each of the pubs. The URL
> mag1.com/bg.asp?manufacturer=15?mag=20 is the same as
> mag2.com/bg.asp?manufacturer=15?mag=20. The app only uses the argumetns
> from the URL, not the domain name. For future crawls I'm pretty sure I
> can filter what I want only and crawl this site on it's own. (I want
> mag=7) It appears I can do that with a callback.
You mean the two URLs return the same content? One solution is to use
the md5 check to filter those out, although it might be slower and the
checksums are stored in RAM.
If you mean a different query string returns the same document, then in
the test_url you can modify the URI object (i.e. set the query string).
The test_url function is called before spider.pl tests to see if it's
see the URL before.
> The issue here is that this crawl is about four days old and has about a
> dozen other sites in the index. The prop.temp file and the .temp index
> file are being written. If I kill this crawl by terminating spider.pl, is
> there any way convert those .temp files left by the terminated crawl to
> usable indices?
Yes. You send the spider a SIGHUP and the spider will stop spidering
and swish-e will write the index.
> Why isn't there a Swish-e O'Reily book? ;-)
I've been wondering what to do with my free time. Oh, wait, I was
saving that for sleep.
Received on Sun Dec 7 02:16:53 2003