Skip to main content.
home | support | download

Back to List Archive

Re: swish fails to close file handles /pipes with prog method

From: Khalid Shukri <khalid(at)>
Date: Wed Jul 31 2002 - 19:38:58 GMT
Am Wed, 21 Aug 2002 schrieben Sie:
> Khalid Shukri escribió:
> > I have a rather weird problem with swish-e:
> > I'm trying to index a lot of sites (about 45000), using the prog  method with
> > the  (on a DSL line ) included in the windows binary distribution,
> > but I want maximally 3 pages from each site. I tried to do this on an old PII
> > with 64MB RAM running windows 2000. It started well (although slow), but became
> > slower and slower while progressing through the 45000 urls, and, more, after
> > some time, it  started reporting "Skipped" about every url. I thought this
> > might be a problem with insufficient memory, swapping etc. I then divided the
> > whole amount into chunks of 1000 which I indexed separately. This worked
> > reasonably  well although still slow. Then I got my brand new p4 with 2 Giga
> > RAM and 1 GHz CPU .-) on which I installed Debian. I then tried to search my
> > old indexes from the windows machine, but swish-e always crashed on certain
> > searchwords. (This is the second problem: Either the index files of the windows
> > version is different from the linux version, or there's a bug in the linux
> > version). I then indexed again, and on my new supercomputer the same thing as on
> > the old windows machine happened.  I put a "open (LOG,file);print LOG
> > something; close LOG;" in the test_url callback routine of the spider to find
> > out what 's happening, but at a certain point, the programm stopped to write
> > anything to the file , saying ("Can't write to closed file handle") . I then
> > tried again to do the indexing in chunks of 1000, but this time started the
> > whole 45 processes in parallel. After some time , I tried to open one of the
> > log files see whats happening, but got the error: "Too many open files".
> > So, my idea is the following: swish-e seems to open a file handle (or pipe? to
> > the spider?) each time its moving to the next url, but fails to close it
> > properly afterwards.
> > Any help/suggestions available?
> > Thanks in advance
> > Khalida
> apparently, you have too many open files in the system. This is a very common
> problem
> when opening lots of sockets (spidering). You start opening sockects, writing
> downloaded
> contents to files and suddenly you've ran out of file descriptor The default value
> (in RedHat Linux)
> is 4096. This is certainly  a low value for spidering and indexing at the same
> time.
> You can change the number of file-descriptors user-wide. The concrete method
> depends on which
> distribution you are using... If you want to know which your limit is just type:
> cat proc/sys/fs/file-max
> and to know the actual used fd's
> watch cat /proc/sys/fs/file-nr
> (you must be aware that... in order to reflect file limit changes you must also
> change the i-node limit, tipically
> 3-4 times the file descriptor limits
> to change your limits, please visit:
> (at the end of page there's a "file descriptors" section
> and
> (search for string "Increasing the Maximum number of file handles and the inode
> cache")
> seems like your problem is very common and easy to solve....
Thanks  a lot. I changed the max-file and max-inode, and also
tcp_keepalive_time, and  now everything works fine
> hope that helps and if so, please let me know... i'm about to write a more
> down-to-earth linux tuning guide
I'ld appreciate that

> bye,
> Oscar Marin Miro
> (by the way... i like your name!!)

You'ld change your mind if you could hear me sing. Makes even my cats run way!
What about your paintings?
Received on Wed Jul 31 19:42:34 2002