Am Tue, 23 Jul 2002 schrieben Sie:
> On Tue, 23 Jul 2002, Khalid Shukri wrote:
> > I have a rather weird problem with swish-e:
> > I'm trying to index a lot of sites (about 45000), using the prog method with
> > the spider.pl (on a DSL line ) included in the windows binary distribution,
> > but I want maximally 3 pages from each site.
> Hum.. might be a good idea to cache those pages locally so the spider can
> check last-modified dates on the next run. 45,000 sites and 3 pages is a
> fair number of docs. Nice you updated you hardware.
I guess I should do that, although spidering became really fast after I
followed Oscar Miro's system tuning advice. Did the whole job in about 5 hours
> > Then I got my brand new p4 with 2 Giga
> > RAM and 1 GHz CPU .-) on which I installed Debian. I then tried to search my
> > old indexes from the windows machine, but swish-e always crashed on certain
> > searchwords. (This is the second problem: Either the index files of the windows
> > version is different from the linux version, or there's a bug in the linux
> > version).
> Yes, the windows binary lags the development version in CVS, so I suspect
> that you are seeing some changes in the index format. Next time we make a
> windows binary we will also provide the associated source snapshot.
> > So, my idea is the following: swish-e seems to open a file handle (or pipe? to
> > the spider?) each time its moving to the next url, but fails to close it
> > properly afterwards.
> No it doesn't work that way. Swish-e opens a pipe to spider.pl one time,
> and then spider.pl just writes the docs one after another to stdout.
> The spider used to be recursive, but that had memory problems with some
> versions of LWP, so now extracted (spidered) links are just added to a
> list and requested one-by-one. Sockets should be closed after each
> request (unless using keep-alive feature), but in any case should be
> closed after each different host.
> Luckily, you are running linux so you can use lsof to see what files the
> process is indeed holding open.
Seem to be mostly TCP sockets. However, I reach a peak of about 17000 open
handles. Wonder what's happening there!
> The spider is designed to really index one host at a time. How do you
> have it configured to spider 45,000 different sites?
I just put a big array of servers in MyConfig.pl
> Bill Moseley email@example.com
Received on Wed Jul 31 19:56:10 2002