Skip to main content.
home | support | download

Back to List Archive

Bug in spider.pl setting no_index from test_url on base_url fails

From: John P. Rouillard <rouilj(at)not-real.cs.umb.edu>
Date: Tue Aug 13 2002 - 22:23:27 GMT
It looks like there is a bug in spider.pl. An attempt to set a no_index
attribute on the base_url using the test_url function fails.

I don't want the base_url page indexed as all the useful info on that
page is included in the articles underneath it. However if that page
shows article summaries, then it will often return a higher score
than the page with the real info.

I think the bug is in line 378 of process_link. You are resetting the
no_index key to zero, but you have already called test_url once while
processing the base_url. I added a test so that the lines:

 # Really should just subclass the response object!
    $server->{no_contents} = 0;
    $server->{no_index} = 0;
    $server->{no_spider} = 0;

now read:

 # Really should just subclass the response object!
    $server->{no_contents} = 0 if $server->{counts}{'Unique URLs'} > 1;
    $server->{no_index} = 0 if $server->{counts}{'Unique URLs'} > 1;
    $server->{no_spider} = 0  if $server->{counts}{'Unique URLs'} > 1; 

An example config entry is:

    {
        skip        => 0,  # skip spidering this server
        
        base_url    => 'some url here',
        agent       => 'swish-e spider http://swish-e.org/',
        email       => 'swish@domain.invalid',

        # limit to real articles
        test_url    => sub {
	 $server=$_[1];
	 $server->{no_index}++ if $_[0]->path =~ m#/search.asp#;
	 $_[0]->path =~ /\/article.asp$/ || $_[0]->path =~ /\/search.asp$/
	},
        delay_min   => .0001,     # Delay in minutes between requests
        max_time    => 60,        # Max time to spider in minutes
    },

Where the top level page is a search/index page for all pages at a
site.

				-- rouilj
John Rouillard
===============================================================================
My employers don't acknowledge my existence much less my opinions.
Received on Tue Aug 13 22:27:07 2002