Skip to main content.
home | support | download

Back to List Archive

Re: Indexing stops early

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Nov 19 2005 - 17:18:54 GMT
On Sat, Nov 19, 2005 at 08:45:47AM -0800, David Chisholm wrote:
> I used spider.pl to crawl about 800 sites to a text file and now I am 
> trying to index it via the prog method. It seems to give up after ~500 
> pages indexed even though I am sure there are many thousand. It's as if 
> it's interpeting something as a stop directive but I am not sure what. 
> Is there something I can look for in my prog output?

I can't think of anything.  Just turn on debugging then see if there's
something in the last page fetched that makes it stop.

There's max_time, max_files, and max_indexed settings, but they should
not be set by default.

It normally would only give up when running out of links.  Maybe you
could have it print a count of the number of links in the queue at the
start of every request.

    while ( @link_array ) {

        die $server->{abort} if $abort || $server->{abort};

        my ( $uri, $parent, $depth ) = @{shift @link_array};

warn "About to spider $uri with ",
        scalar @link_array,
        " items left to spider\n";

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Sat Nov 19 09:18:55 2005