Skip to main content.
home | support | download

Back to List Archive

Re: Bug in spider.pl setting no_index from test_url on base_url fails

From: John P. Rouillard <rouilj(at)not-real.cs.umb.edu>
Date: Tue Aug 13 2002 - 23:27:51 GMT
In message <3.0.3.32.20020813153331.02be0fec@pop3.hank.org>,
Bill Moseley writes:
>At 03:22 PM 08/13/02 -0700, John P. Rouillard wrote:
>>
>>It looks like there is a bug in spider.pl. An attempt to set a no_index
>>attribute on the base_url using the test_url function fails.
>
>Well, it's not really a bug as the docs say:
>
>=item test_url
>
>...
>
>You cannot use the server flags:
>
>    no_contents
>    no_index
>    no_spider
>
>So, you need to set those in a test_response call-back.  test_url is a way
>to avoid fetching the document completely.  You need test_response since
>you don't want to index it, but you still want to follow links in that
>document.  Thus, you still need to fetch that doc.
>
>Does that help?

Ok. I just reread the document. In the printed copy I have the
subsection heads are missing. However in the test_response portion of
the doc is the example:

               test_url => sub {
                   my $server = $_[1];
                   $server->{no_index}++ if $_[0]->path =~ /private\.html$/;
                   return 1;
               },

that needs to be changed. This is in the spider.pl file packaged with
swish_e-2.1dev25.

However, I don't understand why it shouldn't be done in test_url. It
seems to make sense that you can set any of these parameters at the
earliest point that you can determine if it needs to be set. If I can
set it knowing only the path, then I should be able to set it there.
In my case, there is nothing the server could return that would allow
me to determine if it is dynamic content with a short half-life
except for a robots noindex meta command, which will be acted on anyway.

The only server setting that really makes no sense is no-spider which
could be more easily handled by returning 0 from the test_url
function.

In any case thanks for clearing that up. Also in the
SwishSpiderConfig.pl the entry for base_url =>
'http://www.infopeople.org/' has an invalid keep_alives hash key. It
should be keep_alive.

				-- rouilj
John Rouillard
===============================================================================
My employers don't acknowledge my existence much less my opinions.
Received on Tue Aug 13 23:31:21 2002