Skip to main content.
home | support | download

Back to List Archive

Spider.pl problem(s) on Linux (and other UNIXen?)

From: Greg Fenton <greg_fenton(at)not-real.yahoo.com>
Date: Fri May 23 2003 - 17:50:32 GMT
SWISH-E 2.2.3 on RedHat 7.3

A co-worker of mine was trying to crawl a website very, very slowly (to
limit impact on a production website).

He set in his spider.conf:

    base_url => "http://production_webserver",
    delay_min => 0.5,

and has standard test_url and test_response subroutines.

The config works just fine on his W2K workstation, but we wanted to run
it on a real system so we put it on one of our Linux servers.  The
crawl failed with a Perl error:

  is_success not defined on line 477 of spider.pl

Digging through it, I came to the conclusion that this is a problem
with the alarm() code [which does not fire on Win32 platforms].  The
hard-coded defaults for two different params is 30 seconds (which is
what our delay_min is set to).

By adding the following to our server configuration, we were able to
successfully crawl:

    max_wait_time => 60,

There are two issues here then:

1. Documentation for delay_min should reference max_wait_time
2. In the event that an alarm does go off, the code currently
   crashes.  It would be nice if there was at least a message
   indicating the source of the error and possibly a 
   suggestion of how to resolve it.  It would also be nice
   to configure how to handle such alarms (ON_ALARM_EXIT,
   ON_ALARM_RETRY, ON_ALART_SKIP_URL), etc...

I could look into adding such an enhancement if desired, though like
always, time may be an issue.  I don't want to go off working on this
code if someone else "owns" development of the spider (or if the 2.2.3
spider.pl code is going away in a future release).

=====
Greg Fenton
greg_fenton@yahoo.com

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
Received on Fri May 23 17:50:42 2003