Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e], how not to spider file access protocol?

From: Erik van Duren <erik.van.duren(at)>
Date: Fri Nov 02 2007 - 16:36:36 GMT
Thanks for you response,

Just using the scheme in test_url didn't do the job.
What I noticed already is that the file: URLs never show up in the debugging
output. Using your pointer toward the scheme I looked in and finally
got to subroutine check_link. In this subroutine there is a block that's
checking the url->scheme with server->scheme. If these don't match the URL will
be validated, causing the script to check existance of the file (in case of file
scheme). This is exactly what I don't want.
After crudely changing this part of the script to skip validate_link if the
scheme is "file" is functioning exactly as I wanted it to (at least
for my test environment).

$ diff spider.pl_org
>     skip_scheme
<         validate_link( $server, $u, $base ) if $server->{validate_links};
>         unless ( $u->scheme eq $server->{skip_scheme} ) {
>           validate_link( $server, $u, $base ) if $server->{validate_links};
>         }

And added to the config file:
     skip_scheme  => 'file',

Using an array for skip_scheme would even be better, but for me it works.

Thanks for the pointer,


Quoting Bill Moseley <>:

> On Fri, Nov 02, 2007 at 11:35:11AM +0100, Erik van Duren wrote:
> >     test_url    => sub { $_[0]->canonical !~ /file:\/\//i },
> > Without result however:
> I didn't test this but I'd probably use:
>     sub { $_[0]->scheme != 'file' }
> But test to make sure that $_[0]->scheme returns "file".
> -- 
> Bill Moseley
> Unsubscribe from or help with the swish-e list: 
> Help with Swish-e:
> _______________________________________________
> Users mailing list

Users mailing list
Received on Fri Nov 2 12:36:37 2007