Skip to main content.
home | support | download

Back to List Archive

Re: Bug in spider.pl setting no_index from

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 13 2002 - 23:55:12 GMT
At 04:27 PM 08/13/02 -0700, John P. Rouillard wrote:
>Ok. I just reread the document. In the printed copy I have the
>subsection heads are missing. However in the test_response portion of
>the doc is the example:
>
>               test_url => sub {
>                   my $server = $_[1];
>                   $server->{no_index}++ if $_[0]->path =~ /private\.html$/;
>                   return 1;
>               },
>
>that needs to be changed. This is in the spider.pl file packaged with
>swish_e-2.1dev25.

Oh, yes.  Thanks, done.


>However, I don't understand why it shouldn't be done in test_url. It
>seems to make sense that you can set any of these parameters at the
>earliest point that you can determine if it needs to be set. If I can
>set it knowing only the path, then I should be able to set it there.

That's just the way it works.  First, it would mean tracking extra data for
every URL in the list of URLs to process.  When you fetch a page test_url
gets called for every link extracted.  test_response gets called when the
URL is actually fetched, and those two events may happen at different times.

Second, like I said, if you say "no_index" you are saying you still need to
fetch that document, so might as well say no_index when the doc is first
fetched, and that's in test_response.  test_url is when you know you will
never need to fetch that doc from the server and can determine that from
just the URL.

>In my case, there is nothing the server could return that would allow
>me to determine if it is dynamic content with a short half-life
>except for a robots noindex meta command, which will be acted on anyway.

Not sure I follow.  Maybe I missed something in your last message, but you
need to spider your base_url to fetch the links to follow even though you
don't want to index that file.  So you fetch the doc, tell the spider to
not index the contents, but still extract out the links.

>The only server setting that really makes no sense is no-spider which
>could be more easily handled by returning 0 from the test_url
>function.

No, no_spider says to ignore the links in the file just fetched, but go
ahead and index the contents.  Doc must be fetched from the server.  Must
be done in test_response.

Sure, you could do all that in test_url, but then you need to track that
info in the list of URLs that are still pending.  Just easier (and less
memory) to do it in test_response.  Plus you can then check content-type,
and other HTTP headers.

>In any case thanks for clearing that up. Also in the
>SwishSpiderConfig.pl the entry for base_url =>
>'http://www.infopeople.org/' has an invalid keep_alives hash key. It
>should be keep_alive.

Ah, thanks.  Should use swish-e.org, too, I suppose.

Any other comments or suggestions you have are more than welcome!  It's
very helpful to have someone try and figure this stuff out once in a while
and report back.

Thanks,



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Tue Aug 13 23:58:40 2002