On Tue, Aug 31, 2004 at 01:05:01PM -0700, Shaffer, Chris wrote:
> As far as my problem crawling the forums... I think I know what is
> going one... The session_id is changing occasionally, causing it to go
> in circles... Is there any way I can filter out something matching
> 'sid=....' from the end of the path before spider.pl decides whether or
> not its crawled it yet?
Yes, the "test_url()" call-back function is called right before
checking if the URL has already been seen. The test_url() function is
passed the URI object (perldoc URI) and that can be modified.
Untested, but maybe something like in your spider config.
test_uri -> sub {
my ( $uri ) = @_;
my %params = $uri->query_form;
delete $params{sid};
$uri->query_form( %params );
return 1;
},
Problem with that method (using a hash) is that you can't have
multiple parameters of the same name, so be careful. If you might
have parameters with multiple values then look at using the
$uri->param method, instead, or use an array.
There's likely a better tool for dealing with query strings.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Tue Aug 31 16:56:20 2004