Skip to main content.
home | support | download

Back to List Archive

Re: Re: swish fails to close file handles /pipes with

From: Khalid Shukri <khalid(at)not-real.einblick.de>
Date: Wed Jul 31 2002 - 19:52:39 GMT
Am Tue, 23 Jul 2002 schrieben Sie:
> On Tue, 23 Jul 2002, Khalid Shukri wrote:
> 
> > I have a rather weird problem with swish-e:
> > I'm trying to index a lot of sites (about 45000), using the prog  method with
> > the spider.pl  (on a DSL line ) included in the windows binary distribution,
> > but I want maximally 3 pages from each site.
> 
> Hum.. might be a good idea to cache those pages locally so the spider can
> check last-modified dates on the next run.  45,000 sites and 3 pages is a
> fair number of docs.  Nice you updated you hardware.
> 
I guess I should do that, although spidering became really fast after I
followed Oscar Miro's system tuning advice. Did the whole job in about 5 hours

> 
> 
> > Then I got my brand new p4 with 2 Giga
> > RAM and 1 GHz CPU .-) on which I installed Debian. I then tried to search my
> > old indexes from the windows machine, but swish-e always crashed on certain
> > searchwords. (This is the second problem: Either the index files of the windows
> > version is different from the linux version, or there's a bug in the linux
> > version).
> 
> Yes, the windows binary lags the development version in CVS, so I suspect
> that you are seeing some changes in the index format. Next time we make a
> windows binary we will also provide the associated source snapshot.
> 
> 
> > So, my idea is the following: swish-e seems to open a file handle (or pipe? to
> > the spider?) each time its moving to the next url, but fails to close it
> > properly afterwards.
> 
> No it doesn't work that way.  Swish-e opens a pipe to spider.pl one time,
> and then spider.pl just writes the docs one after another to stdout.
> 
> The spider used to be recursive, but that had memory problems with some
> versions of LWP, so now extracted (spidered) links are just added to a
> list and requested one-by-one.  Sockets should be closed after each
> request (unless using keep-alive feature), but in any case should be
> closed after each different host.
> 
> Luckily, you are running linux so you can use lsof to see what files the
> process is indeed holding open.

Seem to be mostly TCP sockets. However, I reach a peak of about 17000 open
handles. Wonder what's happening there!


> 
> The spider is designed to really index one host at a time.  How do you
> have it configured to spider 45,000 different sites?
> 
I just put a big array of servers in MyConfig.pl

> 
> 
> -- 
> Bill Moseley moseley@hank.org

Bye
Khalid
Received on Wed Jul 31 19:56:10 2002