Skip to main content.
home | support | download

Back to List Archive

Re: Re: swish fails to close file handles /pipes

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Aug 04 2002 - 17:49:17 GMT
At 09:20 AM 08/04/02 -0700, Khalid Shukri wrote:
>> For your situation using a spider that fetches URLs in parallel might be a
>> good thing.
>I just started 45 swish-e/spider processes at the same time, which worked
>quite well, but if you can recommend a good parallel spider i might try that
>the next time. 

You might check out a column by Randal Schwartz:

  http://www.stonehenge.com/merlyn/LinuxMag/col16.html

Then, for more perl, there's things like:

http://search-beta.cpan.org/author/MARCLANG/ParallelUserAgent-2.54/lib/LWP/Parallel/RobotUA.pm

Randal used that module in http://www.stonehenge.com/merlyn/WebTechniques/col27.html, although I also think he had reasons later on for not using that (reasons that I can not remember at this time).

Or if you already know what you want to fetch and want more control.

http://search-beta.cpan.org/author/DLUX/Parallel-ForkManager-0.7.4/ForkManager.pm

If you are only fetching three docs from each site then KeepAlives will be of less importance.  Regardless, a timeout of six minutes seems a bit much.  I'd want to abort if the docs don't come in a few seconds.  Also, If you reject a document in the test_response() call-back then the connection will be broken.  That's because test_response() is called after the first chunk of data is returned from the remote host, not after the entire doc was fetched.  I suppose a chunked fetch could solve that problem.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Sun Aug 4 17:52:49 2002