Skip to main content.
home | support | download

Back to List Archive

Re: Indexing (test_url)

From: Richard Vaillancourt <Richard.Vaillancourt(at)not-real.sit.ulaval.ca>
Date: Tue Aug 09 2005 - 18:00:57 GMT
I used ReplaceRules parameter in the config file and got entries like
this in the log while indexing, which seemed like we had solved our
problem:

Original String:
'http://www.sit.ulaval.ca/sgc/nous_joindre/cache/offonce;jsessionid=ECB59031040C6AD9AA3B49FFD2EDFC5E'
replace
http://www.sit.ulaval.ca/sgc/nous_joindre/cache/offonce;jsessionid=ECB59031040C6AD9AA3B49FFD2EDFC5E =~ m[\;jses.+][]: Matched
  Result String:
'http://www.sit.ulaval.ca/sgc/nous_joindre/cache/offonce'

We didn't verified how pages appear in the index (i.e. indexed only once
or not) but it doesn't matter as we're still fetching an incredible
amount of pages from our server. Here's some commands I executed in
shell that show this:

# grep "^>> +Fetched [0-9] " sitJahia20050808.log | cut -d' ' -f6 | perl
-pe 's/\;jsessionid.+//g' | wc -l
   9293
# grep "^>> +Fetched [0-9] " sitJahia20050808.log | cut -d' ' -f6 | perl
-pe 's/\;jsessionid.+//g' | sort -u | wc -l
   1272
#

On 9293 pages fetched, only 1272 are unique and that is because we're
still fetching pages that have that ";jsessionid=" ending more than
once.

We also tried using a filter_content() callback function, without
success.

While it seems that we can filter before indexing, it would be more
handy to filter before fetching. Ideas anyone?

Thanks.

Richard Vaillancourt
SIT, Division des systèmes
Pavillon Casault, Université Laval, Ste-Foy, Canada, G1K 7P4
Richard.Vaillancourt@sit.ulaval.ca
Tél: 418-656-2131 poste 6280,  Télécopieur: 418-656-7305
www: http://www.sit.ulaval.ca/pp/rva/rva.html


-----Message d'origine-----
De : swish-e@sunsite3.berkeley.edu [mailto:swish-e@sunsite3.berkeley.edu] De la part de Bill Moseley
Envoyé : 29 juillet, 2005 11:59
À : Multiple recipients of list
Objet : [SWISH-E] Re: Indexing (test_url)

On Fri, Jul 29, 2005 at 08:14:12AM -0700, Richard Vaillancourt wrote:
> For the following Web page:
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659
> 
> If all the users are authenticate for this page, we will find in the Web cache of Jahia several Web pages:
> 
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C5
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C4
> http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C3
> etc.
> 
> If I put a filter in Swish to remove the pages containing the string ";jsessionid", I then remove the indexation of the http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659 page.
> 
> How to remove the indexing of all pages that contain the
> ";jsessionid" string without removing the single page without parameters (which is never call alone without parameters).

Is the question how to remove the jsessionid=.* part of the url that
is stored in the swish-e index?

That is, the session is added when indexing but you do not what that
session stored in the index?

Then you could use ReplaceRules or modify the URI object in a
filter_content() callback function in the spider config file.

If you just want to avoid indexing any page that has a jsessionid
they use test_url():


> 
> One solution would be to modify the URI variable in the subroutine
> test_url in the Perl script "SwishSpiderConfigSit.pl" by adding the following line:
> 
> sub test_url {
>     my ( $uri, $server ) = @_;
>     # return 1;  # Ok to index/spider
>     # return 0;  # No, don't index or spider;
> 
>     # ignore any common image files
>   
>     # ***************************************************
>     # My new line
>     $uri->path =~ s/\;jses.+//;
>     # ***************************************************
> 
>     return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;
> 
> }
> 
> The problem is that the URI variable is in read only.  Does somebody have ideas or solutions to help me?


What do you mean it's read only?

First you might want to look over "perldoc URI".

moseley(at)not-real.laptop:~$ perl -MURI -le '$u=URI->new("http://server/path?p1=one&p2=two"); print $u->query'
p1=one&p2=two

So you could just say in test_url()

    return if $uri->query =~ /jsessionid/;

You can get a bit more specific using the $uri->query_form function.

moseley(at)not-real.laptop:~$ perl -MData::Dumper -MURI -le '$u=URI->new("http://server/path?p1=one&p2=two"); my %hash = $u->query_form; print Data::Dumper::Dumper(\%hash);'
$VAR1 = {
          'p2' => 'two',
          'p1' => 'one'
        };

(although if you have two parameters with the same name one will be
lost).


Does that make sense?

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Aug 9 11:01:20 2005