Skip to main content.
home | support | download

Back to List Archive

suggestions for spider.pl

From: Michael <michael(at)not-real.insulin-pumpers.org>
Date: Thu Aug 01 2002 - 18:37:40 GMT
I've made some minor changes to "spider.pl" that I've found very 
useful so I thought I'd pass them on. Basically I converted 
"spider.pl" to a perl module, making the in-line code a subroutine 
and adding a simple "sub init" to clear the variables for each pass. 
This allows the script to be used basically "as-is" by calling 

use lib qw(./);
use Swish::Spider;

Swish::Spider::init;
Swish::Spider::run_spider;

from within a small script
 The advantages come when you write your own "steering" routine to 
replace "run_spider" with something different.

I run this from a perl script that is similar to the original in-line 
code but includes more logic to examine the last modified date of the 
previous index and modify the criteria for "test_url" for each site 
that is indexed and optionally using individual config files for each 
site when spidered. This lets me use current "spider.pl's" with 
minimum changes as they are updated yet preserve the code written for 
our spidering script.

I've included some examples from our scripts.

in the spider.pl doc it says....

swish-e -S prog -c swish.config
perl spider.pl | swish-e -S prog -c swish.conf -i stdin

our scripts do....

# somewhere else....
foreach($indx) {
  my $cf = $swishlib .'/'. $prefix . '.config';
# checkfor for  individual config file
  $cf = $swishlib .'/'. $swconfig unless -e $cf
  my $i = qq($swish $verbose -S prog -c $cf -i  stdin -f $indx); 
  local *SAVOUT; 
  open SAVOUT, ">&STDOUT"; 
  open  STDOUT, "|$i";

  foreach(@urls) {
    $servers{base_url} = $_;
    &Swish::Spider::process_server( $s );
  }
  close STDOUT;
  open STDOUT, ">&SAVOUT";


and in test_url

# elements of z@{$s->{must_match}} = one of:
#       Note: all values may contain 'regexps'
#  1)   rewrite url starts with '~' and is of the form
#       '~s/string_one[a-z]+/string_two/'
#  2)   URI must not contain (starts with '!')
#       '!SESSID'
#  3)   URI must contain
#       'some_path'

sub test_url {
  my ($url,$s) = @_;
# exclude images
  return 0 if ($url =~ m|/[a-zA-Z0-9\.\_\-]+\.($non_text)[\?#;]*|io); 
   
# special URL conditions
  foreach (@{$s->{must_match}}) {
    if ($_ =~ /^~/) {                   # re-write url
      my $exp = '$url->opaque($u) if $u =~ ' . $';
      my $u = $url->opaque;
      eval $exp;
    } elsif ($_ =~ /^!(.*)/) {          # must not contain
      return 0 if $url =~ /$1/;
    } else {                            # must contain
      return 0 unless $url =~ /$_/;
    }
  }
  return 1;
}

DIFF for spider.pl
# diff Spider.pm spider.pl 
2d1
< package Swish::Spider;
66,83c65
< use vars qw(
< # global -- I suppose would be smarter to localize it per server. 
<       $abort 
<       %visited 
<       %validated 
<       %bad_links 
< ); 
< 
< sub init { 
<   $abort      = 0; <   %visited    = (); 
<   %validated  = (); 
<   %bad_links  = (); 
< } 
< 
< sub run_spider { 
< 
<    my @servers; 
--- 
>     use vars '@servers';
95a78 
>     my $abort;
97a81,85 
>     my %visited;  # global -- I suppose would be smarter to localize it per server. 
> 
>     my %validated; 
>     my %bad_links; 
> 
119c107 
< } 
--- 
> 

Hope this is of some use to others.
Michael@Insulin-Pumpers.org
Received on Thu Aug 1 18:41:17 2002