Skip to main content.
home | support | download

Back to List Archive

spidering with swish

From: Bill Moseley <moseley(at)>
Date: Sun Apr 08 2001 - 23:28:09 GMT
I'm still have hope that someone will help and do some testing of the new
spider program.  I'd like to have some testing before it gets released in a
new version of swish.

So, here's output from a test.  I don't know if this seems fast or not,
since I've never spidered with swish 1.3.2.  Does this looks interesting
enough to make you want help out and test on your site?

(The times/second don't really make that much sense -- it's just the count
divided by the total time to complete.)

../ Summary for: http://localhost/
DOC transformed:        13  (0.2/sec)
     Duplicates:     3,763  (63.8/sec)
 Off-site links:     2,750  (46.6/sec)
PDF transformed:        22  (0.4/sec)
        Skipped:     3,766  (63.8/sec)
    Total Bytes: 3,961,551  (67144.9/sec)
    Unique URLs:       657  (11.1/sec)
     robots.txt:        62  (1.1/sec)

22326 unique words indexed.
Writing file index...
543 files indexed.
Running time: 1 minute, 5 seconds.
Indexing done!

The Unique URLs are just the count of files requested from the server.
Some are rejected due to their content type.

And here's the spider config that created this:

@servers = (
        base_url        => 'http://localhost/',
        email           => '',
        delay_min       => .0001,
        link_tags       => [qw/ a frame /],

        test_url        => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },
        filter_content  => [ \&pdf, \&doc ],

        test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } 
              qw{ text/html text/plain application/pdf application/msword };
            return 1 if $ok;
            print STDERR "$_[0] wrong content type ( $content_type )\n";

Bill Moseley
Received on Sun Apr 8 23:29:25 2001