Skip to main content.
home | support | download

Back to List Archive

Re:

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jan 04 2006 - 17:28:04 GMT
Didn't we discuss this a month ago?

On Wed, Jan 04, 2006 at 07:47:39AM -0800, Chad Day wrote:
> sub test_url {
> 
>         my ($uri, $server) =3D @_ ;
> 
>         return if $uri->query =3D~ /PHPSESSID/;
> }

$uri->query won't always be defined for every url so you will see:

> Use of uninitialized value in pattern match (m//) at
> /usr/local/apache/htdocs/dev/components/com_swishesearch/spider.conf
> line 11.


You are also saying "return if", so you aren't telling it what to
return if it doesn't match.

    return ($uri->query || '') =~ /PHPSESSID/;

which will return true if it matches and false if it doesn't match.
In test_url a true return means to index the file.  So, only URLs that
have PHPSESSID in their url will be indexed.

> Is there some sort of syntax with my test_url bit I'm missing?  I'm
> trying to ensure the same pages aren't indexed repeatedly due to
> changing PHPSESSID variables when spidering the site.

Oh, the above won't work then.

Have you considered using the "use_md5" option?  That will work unless
the pages have different content based on the session.

You need to either remove the PHPSESSID from the URL (so that is not
part of the url that is used to check if it's been seen before, OR you
need to track the session id in your spider config which would mean
extracting out the session id from the URL, storing.

Here's both examples.

First, just to remove the session if from the URL:

    sub test_url {
        my $uri = shift;
        my %params = $uri->query_form;
        delete $params{PHPSESSID};
        $uri->query_form( %params );
        return 1;
    }

Then all the links that swish uses will be without the session id.


If you need to keep the session id, but just want to make sure you
only use one session id, then:


    my $session_id = '';  # initialize the session.

    sub test_url {
        my $uri = shift;

        my $uri = shift;
        my %params = $uri->query_form;

        # return true if there's no session id
        my $session = $params{PHPSESSID} || return 1;

        # Save session id if first time
        $session_id = $session unless $session_id;

        # Return true if the session matches, false otherwise
        return $session_id eq $session;
    }

You should figure out why you session id is changing when spidering.
That sounds like there's something broken in your site's setup.

To understand the URI object you can try things at the command line:

$ perl -MURI -lwe 'print URI->new("http://foo.com:33/path/to/stuff?PHPSESSID=123")->query'
PHPSESSID=123

$ perl -MURI -lwe 'print URI->new("http://foo.com:33/path/to/stuff?PHPSESSID=123")->host'
foo.com

$ perl -MData::Dumper -MURI -lwe 'print Data::Dumper::Dumper{ URI->new("http://foo.com:33/path/to/stuff?PHPSESSID=123&foo=otherstuff")->query_form}'
$VAR1 = {
          'PHPSESSID' => '123',
          'foo' => 'otherstuff'
        };





-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Jan 4 09:28:06 2006