Skip to main content.
home | support | download

Back to List Archive

Re: Server or documet limit on spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jan 07 2004 - 16:36:19 GMT
On Wed, Jan 07, 2004 at 07:45:45AM -0800, Ander wrote:
> Hi all:
> 
> I'm using spider.pl to index a list of servers, which I create
> dinamically (from a database). When we have 2500 documents indexed
> (more or less), spidering (and indexing, of course) stops.

I can't think of anything.  You might be able to enable some of the 
debugging options to watch the progress, but if it's quiting for normal 
reasons it won't report anything.

I'll show a patch below that will print out the size of the array of 
links in the queue (and to disable the default 5 second delay).

Run as (adjust for your shell):

$ SPIDER_DEBUG=url,links ./spider.pl default http://localhost >/dev/null 2>spider.out

  

Is it possible that it's running out of links to follow?
Is it possible that the spider is eating memory and the process is being 
killed by process limits?

--- /usr/local/lib/swish-e/spider.pl    2003-12-13 14:13:03.000000000 -0800
+++ spider.pl   2004-01-07 08:24:07.000000000 -0800
@@ -247,9 +247,9 @@
             $server->{delay_sec} = int ($server->{delay_min} * 60);
         }
         
-        $server->{delay_sec} = 5 unless defined $server->{delay_sec};
+        $server->{delay_sec} = 0 unless defined $server->{delay_sec};
     }
-    $server->{delay_sec} = 5 unless $server->{delay_sec} =~ /^\d+$/;
+    $server->{delay_sec} = 0 unless $server->{delay_sec} =~ /^\d+$/;
     
 
     if ( $server->{ignore_robots_file} ) {
@@ -395,6 +395,7 @@
         die $server->{abort} if $abort || $server->{abort};
 
         my ( $uri, $parent, $depth ) = @{shift @link_array};
+print STDERR "Links left in array = " . scalar @link_array . "\n";
         
         delay_request( $server );
         


-- 
Bill Moseley
moseley@hank.org
Received on Wed Jan 7 16:36:29 2004