Skip to main content.
home | support | download

Back to List Archive

Re: Indexing (test_url)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jul 29 2005 - 15:59:44 GMT
On Fri, Jul 29, 2005 at 08:14:12AM -0700, Richard Vaillancourt wrote:
> For the following Web page:
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659
> 
> If all the users are authenticate for this page, we will find in the Web cache of Jahia several Web pages:
> 
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C5
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C4
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C3
> etc.
> 
> If I put a filter in Swish to remove the pages containing the string ";jsessionid", I then remove the indexation of the http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659 page.
> 
> How to remove the indexing of all pages that contain the
> ";jsessionid" string without removing the single page without parameters (which is never call alone without parameters).

Is the question how to remove the jsessionid=.* part of the url that
is stored in the swish-e index?

That is, the session is added when indexing but you do not what that
session stored in the index?

Then you could use ReplaceRules or modify the URI object in a
filter_content() callback function in the spider config file.

If you just want to avoid indexing any page that has a jsessionid
they use test_url():


> 
> One solution would be to modify the URI variable in the subroutine
> test_url in the Perl script "SwishSpiderConfigSit.pl" by adding the following line:
> 
> sub test_url {
>     my ( $uri, $server ) = @_;
>     # return 1;  # Ok to index/spider
>     # return 0;  # No, don't index or spider;
> 
>     # ignore any common image files
>   
>     # ***************************************************
>     # My new line
>     $uri->path =~ s/\;jses.+//;
>     # ***************************************************
> 
>     return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;
> 
> }
> 
> The problem is that the URI variable is in read only.  Does somebody have ideas or solutions to help me?


What do you mean it's read only?

First you might want to look over "perldoc URI".

moseley(at)not-real.laptop:~$ perl -MURI -le '$u=URI->new("http://server/path?p1=one&p2=two"); print $u->query'
p1=one&p2=two

So you could just say in test_url()

    return if $uri->query =~ /jsessionid/;

You can get a bit more specific using the $uri->query_form function.

moseley(at)not-real.laptop:~$ perl -MData::Dumper -MURI -le '$u=URI->new("http://server/path?p1=one&p2=two"); my %hash = $u->query_form; print Data::Dumper::Dumper(\%hash);'
$VAR1 = {
          'p2' => 'two',
          'p1' => 'one'
        };

(although if you have two parameters with the same name one will be
lost).


Does that make sense?

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Jul 29 08:59:45 2005