Skip to main content.
home | support | download

Back to List Archive

Re: Spidering phpBB

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 31 2004 - 23:55:06 GMT
On Tue, Aug 31, 2004 at 01:05:01PM -0700, Shaffer, Chris wrote:
> As far as my problem crawling the forums...  I think I know what is
> going one...  The session_id is changing occasionally, causing it to go
> in circles...  Is there any way I can filter out something matching
> 'sid=....' from the end of the path before spider.pl decides whether or
> not its crawled it yet?

Yes, the "test_url()" call-back function is called right before
checking if the URL has already been seen.  The test_url() function is
passed the URI object (perldoc URI) and that can be modified.

Untested, but maybe something like in your spider config.


    test_uri -> sub {
        my ( $uri ) = @_;
        my %params = $uri->query_form;
        delete $params{sid};
        $uri->query_form( %params );
        return 1;
    },

Problem with that method (using a hash) is that you can't have
multiple parameters of the same name, so be careful.  If you might
have parameters with multiple values then look at using the
$uri->param method, instead, or use an array.

There's likely a better tool for dealing with query strings.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Aug 31 16:56:20 2004