On Sat, Nov 19, 2005 at 08:45:47AM -0800, David Chisholm wrote:
> I used spider.pl to crawl about 800 sites to a text file and now I am
> trying to index it via the prog method. It seems to give up after ~500
> pages indexed even though I am sure there are many thousand. It's as if
> it's interpeting something as a stop directive but I am not sure what.
> Is there something I can look for in my prog output?
I can't think of anything. Just turn on debugging then see if there's
something in the last page fetched that makes it stop.
There's max_time, max_files, and max_indexed settings, but they should
not be set by default.
It normally would only give up when running out of links. Maybe you
could have it print a count of the number of links in the queue at the
start of every request.
while ( @link_array ) {
die $server->{abort} if $abort || $server->{abort};
my ( $uri, $parent, $depth ) = @{shift @link_array};
warn "About to spider $uri with ",
scalar @link_array,
" items left to spider\n";
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Sat Nov 19 09:18:55 2005