On Tue, 23 Jul 2002, Khalid Shukri wrote:
> I have a rather weird problem with swish-e:
> I'm trying to index a lot of sites (about 45000), using the prog method with
> the spider.pl (on a DSL line ) included in the windows binary distribution,
> but I want maximally 3 pages from each site.
Hum.. might be a good idea to cache those pages locally so the spider can
check last-modified dates on the next run. 45,000 sites and 3 pages is a
fair number of docs. Nice you updated you hardware.
> Then I got my brand new p4 with 2 Giga
> RAM and 1 GHz CPU .-) on which I installed Debian. I then tried to search my
> old indexes from the windows machine, but swish-e always crashed on certain
> searchwords. (This is the second problem: Either the index files of the windows
> version is different from the linux version, or there's a bug in the linux
Yes, the windows binary lags the development version in CVS, so I suspect
that you are seeing some changes in the index format. Next time we make a
windows binary we will also provide the associated source snapshot.
> So, my idea is the following: swish-e seems to open a file handle (or pipe? to
> the spider?) each time its moving to the next url, but fails to close it
> properly afterwards.
No it doesn't work that way. Swish-e opens a pipe to spider.pl one time,
and then spider.pl just writes the docs one after another to stdout.
The spider used to be recursive, but that had memory problems with some
versions of LWP, so now extracted (spidered) links are just added to a
list and requested one-by-one. Sockets should be closed after each
request (unless using keep-alive feature), but in any case should be
closed after each different host.
Luckily, you are running linux so you can use lsof to see what files the
process is indeed holding open.
The spider is designed to really index one host at a time. How do you
have it configured to spider 45,000 different sites?
Bill Moseley firstname.lastname@example.org
Received on Tue Jul 23 15:39:24 2002