Skip to main content.
home | support | download

Back to List Archive

Re: limit in large IncludeConfigFile

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Nov 17 2002 - 18:15:05 GMT
At 09:41 AM 11/17/02 -0800, Leif Larsson wrote:
>SwishProgParameters default http://bim.ce.kth.se http://www.svktf.se
>etc... etc... (lots of sites)

>Bad directive on line #2 of file sites.txt: /www.euronom.se http://www.e
>urovac.nu etc... etc...
>
>Seems to me there is a 2000 character limit. As soon as the included
>"sites.txt" file grows over this limit, spider.pl barfs on me.
>
>Am i missing something ?

Nope, there's a line length limit in the config file.

Instead use a config file for the spider.  Since  it's Perl it give you a
lot of options.

One way would be to do this in a spider config file (not tested, but
probably not too far off...)


# Define list of servers to spider
my @server_list = qw(
   http://bim.ce.kth.se
   http://www.svktf.se
   ...
);

@servers = (
   {
        base_url        => \@server_list,
        email           => 'leif.larsson@l3system.se',
        delay_min       => .0001,
        link_tags       => [qw/ a frame /],
        test_url        => sub { $_[0]->path !~ /\.(?:gif|jpeg|png)$/i },

        test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } @content_types;
            return 1 if $ok;
            print STDERR "$_[0] $content_type != (@content_types)\n";
            return;
        },
     
    },
);
1;

In case you want different settings for each host, you might be better off
doing something like:


# Define list of servers to spider
my @server_list = qw(
   http://bim.ce.kth.se
   http://www.svktf.se
   ...
);


my %spider_config = (
    email           => 'leif.larsson@l3system.se',
    delay_min       => .0001,
    link_tags       => [qw/ a frame /],
    test_url        => sub { $_[0]->path !~ /\.(?:gif|jpeg|png)$/i },

    test_response   => sub {
        my $content_type = $_[2]->content_type;
        my $ok = grep { $_ eq $content_type } @content_types;
        return 1 if $ok;
        print STDERR "$_[0] $content_type != (@content_types)\n";
        return;
    },
);

for ( @server_list )
    my %this_host = %spider_config;
    $this_host{base_url} = $_;
    # maybe set "same_hosts" settings for each server

    push @servers, \%this_host;
}
    
1;

Then you can do things like get the server list from a file (seems silly to
have another file, though) or a database or whatever.

-- 
Bill Moseley
mailto:moseley@hank.org
Received on Sun Nov 17 18:15:16 2002