Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] spider.pl, how not to spider file access protocol?

From: Erik van Duren <erik.van.duren(at)not-real.snow.nl>
Date: Fri Nov 02 2007 - 16:36:36 GMT
Thanks for you response,

Just using the scheme in test_url didn't do the job.
What I noticed already is that the file: URLs never show up in the debugging
output. Using your pointer toward the scheme I looked in spider.pl and finally
got to subroutine check_link. In this subroutine there is a block that's
checking the url->scheme with server->scheme. If these don't match the URL will
be validated, causing the script to check existance of the file (in case of file
scheme). This is exactly what I don't want.
After crudely changing this part of the script to skip validate_link if the
scheme is "file" spider.pl is functioning exactly as I wanted it to (at least
for my test environment).

$ diff spider.pl_org spider.pl
74a75
>     skip_scheme
1307c1308,1310
<         validate_link( $server, $u, $base ) if $server->{validate_links};
---
>         unless ( $u->scheme eq $server->{skip_scheme} ) {
>           validate_link( $server, $u, $base ) if $server->{validate_links};
>         }

And added to the config file:
     skip_scheme  => 'file',

Using an array for skip_scheme would even be better, but for me it works.

Thanks for the pointer,

Erik.

Quoting Bill Moseley <moseley@hank.org>:

> On Fri, Nov 02, 2007 at 11:35:11AM +0100, Erik van Duren wrote:
> >     test_url    => sub { $_[0]->canonical !~ /file:\/\//i },
> > Without result however:
> 
> I didn't test this but I'd probably use:
> 
>     sub { $_[0]->scheme != 'file' }
> 
> But test to make sure that $_[0]->scheme returns "file".
> 
> -- 
> Bill Moseley
> moseley@hank.org
> 
> Unsubscribe from or help with the swish-e list: 
>    http://swish-e.org/Discussion/
> 
> Help with Swish-e:
>    http://swish-e.org/current/docs
> 
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
> 



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Nov 2 12:36:37 2007