Skip to main content.
home | support | download

Back to List Archive

Re: Limiting indexing

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jul 03 2002 - 19:21:01 GMT
At 11:41 AM 07/03/02 -0700, Jody Cleveland wrote:
>Hi Bill,
>
>> Are you using spider.pl (-S prog) or the -S http method?
>
>swish-e -S prog -c spider.config

A bit of an overview:

spider.pl can use a configuration file.  It's SwishSpiderConfig.pl by
default, but can be specified in your spider.config file with the swish
directive SwishProgParameters.  

Or in more general terms, the swish configuration parameter
SwishProgParameters is used to define parameters that are passed to the
program run in -S prog mode, and the spider.pl program takes as its first
parameter the path to a configuration file.

I'll assuming you are using the default spider.pl config file,
SwishSpiderConfig.pl.

That file sets a array perl array.  Each element of that array is a hash
(reference to a hash) that describes a web server to spider.  The point of
using an array of hashes is so that you can define more than one server to
spider, or so you can spider different parts of the same web server.



Now, in perl a hash looks like this:

my %server_data = (
        base_url    => 'http://www.winefox.org/wals/index.html',
        agent       => 'swish-e spider http://swish-e.org/',
        email       => 'your@emailaddress',

        # limit to only .html files
        test_url    => sub { $_[0]->path =~ /\.html?$/ },

        delay_min   => .0001,     # Delay in minutes between requests
        max_files   => 100,       # Max Unique URLs to spider

);

Note that each entry ends with a comma.  The arrow thingy in perl => is
basically a comma, but it quotes the preceding word.  Keys must be unique
-- you can't have two entries with the same key value.

%hash = (
     this => 100,
     this => 200,
);

200 replaced 100 on the second line.

And then, we need an array of hashes (although normally it's only an array
with one element, so that's just:

  @servers = ( \%server_data );

That's defining an array @servers that contains one hash (well, one
reference to a hash).

Now, the above hash assigns values to keys.

        max_files   => 100,       # Max Unique URLs to spider

Set's the hash value of max_files to 100.


This is a bit different:

        # limit to only .html files
        test_url    => sub { $_[0]->path =~ /\.html?$/ },

assigns a subroutine to the test_url value.  

The "$_[0]" parameter is the first parameter passed to that function, and
as described in the spider.pl docs, it is a URI object.  Type "perldoc URI"
at your local prompt for info on URI objects.

That subroutine is called every time a URL is parsed out of a spidered web
page.  If the call-back function returns true, then the URL is added to the
list of URLs to spider.  If it returns false then it is not added to the
list. 

So above it only returns true the the path part of the URL ends in ".html".


So with the URI object you can test the path, which is what I assume you
want to do.  So, say you want to index winnefox.org/wals/*, but only things
that start with "/wals", so I'd do something like:

[I did not test any of these examples.]

    test_url => sub {
	my $uri = shift;
	return $uri->path =~ m[^/wals/];
    },

Might as well make things clear:

    test_url => sub {
	my $uri = shift;
       if ( $uri->path =~ m[^/wals/] ) {
           warn "Indexing $uri\n";
           return 1;
       } else {
           warn "Skipping $uri\n";
           return 0;
       }
    },

Does that help?



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Jul 3 19:24:31 2002