Skip to main content.
home | support | download

Back to List Archive

Re: request delay problem with spider.pl

From: Aliasgar Dahodwala <g_adahodwala(at)not-real.umassd.edu>
Date: Tue Jul 05 2005 - 19:20:25 GMT
My MaxKeepAliveRequests is set to the default of 100 in apache.

I agree with you, from what i figured from the spider.pl code the spider 
should put in a delay of delay_sec between two connection requests. I 
logged the debug messages and saw the "sleeping 5 seconds" message after 
approximately every 100 requests. which is fine.

what i failed to find out is, why does the spider sleep, something 
around 5000 x delay_sec after fetching somewhere around 5824 files.
(the exact count value is 5824).  In the debug file i have that many 
"sleeping 5 seconds" messages, before the spider starts fetching again.

so i am thinking there is a bug in there somehwhere.

Regards
Aliasgar.

Bill Moseley wrote:

>On Tue, Jul 05, 2005 at 09:13:08AM -0700, Aliasgar Dahodwala wrote:
>  
>
>>I am running swish-e 2.4.3 on a redhat linux box. I am using the 
>>included spider.pl script to spider my website.
>>
>>My problem: When i enable the keep_alive directive of the spider program 
>>and set the delay_sec to 5, the spider fetches the pages at blazing 
>>speed ignoring the delay_sec directive, and after going through around 
>>5000 pages it then catches up on all the delay, it stops fetching any 
>>more pages and just keeps sleeping for 5 seconds each. After a long wait 
>>it continues from where it left off.
>>    
>>
>
>Sounds like a bug.  By design it ignores the delay_sec setting in a
>keep alive connection.  The point of the keep alive is to allow faster
>requests -- avoiding the time required to start up the new connection.
>
>From the docs:
>
># delay_sec
>
>    This optional key sets the delay in seconds to wait between
>    requests.  See the LWP::RobotUA man page for more information. The
>    default is 5 seconds. Set to zero for no delay.
>
>    When using the keep_alive feature (recommended) the delay will be
>    used only where the previous request returned a "Connection:
>    closed" header.
>
>
>So after fetching 5000 docs (is that your MaxKeepAliveRequests set to
>5000?) you are saying that the spider delays delay_sec seconds x 5000
>before it fetches any more documents?
>
>Let's see, the wait time is set here:
>
>    my $wait = $server->{delay_sec} - ( time - $server->{last_response_time} );
>    return unless $wait > 0;
>    sleep( $wait );
>
>That last_response_time is the time the last request was completed,
>which should normally be almost the same as the current time, so you
>end up with delay_sec.  So I don't see how it could be delaying more
>than delay_sec.
>
>Is that what you mean?
>
>  
>




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Tue Jul 5 12:20:30 2005