Skip to main content.
home | support | download

Back to List Archive

Re: Change queue URL in test_url

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue May 25 2004 - 19:18:47 GMT
On Tue, May 25, 2004 at 12:06:26PM -0700, Justin Tang wrote:
> Hi:
>   I was wondering if there is any way to change the URL that is about to be
> queued using a call back function in test_url.  Specifically, say if I have
> 
> www.mysite.com/page.html?query=value
> 
> to be placed in the queue, and I want to change it to
> 
> www.mysite.com/page.html
> 
> how can I change the URL that is being passed back?  Thanks!

I think so.  Try something like:

sub remove_query {
    my ( $uri ) = @_;
    $uri->query( undef )
        if $uri->path eq '/page.html';

    return 1;
}

then in your spider config

    test_url => \&remove_query,

(I think you can specify more than one function like this, if you needed
to do so:

    test_url => [ \&remove_query, \&other_subroutine ],

$uri is a URI object.  perldoc URI to see how you can mess with it.


Note that after test_url is checked, spider.pl then checks if
$uri->canonical has been visited before.  So if you do the above it will
only be visited once.



> 
> -Justin
> 
> 

-- 
Bill Moseley
moseley@hank.org
Received on Tue May 25 12:18:47 2004