At 09:20 AM 08/04/02 -0700, Khalid Shukri wrote:
>> For your situation using a spider that fetches URLs in parallel might be a
>> good thing.
>I just started 45 swish-e/spider processes at the same time, which worked
>quite well, but if you can recommend a good parallel spider i might try that
>the next time.
You might check out a column by Randal Schwartz:
Then, for more perl, there's things like:
Randal used that module in http://www.stonehenge.com/merlyn/WebTechniques/col27.html, although I also think he had reasons later on for not using that (reasons that I can not remember at this time).
Or if you already know what you want to fetch and want more control.
If you are only fetching three docs from each site then KeepAlives will be of less importance. Regardless, a timeout of six minutes seems a bit much. I'd want to abort if the docs don't come in a few seconds. Also, If you reject a document in the test_response() call-back then the connection will be broken. That's because test_response() is called after the first chunk of data is returned from the remote host, not after the entire doc was fetched. I suppose a chunked fetch could solve that problem.
Received on Sun Aug 4 17:52:49 2002