We currently use the index search engine Swish-e version 2.2.3 on Linux RedHat for indexing pages in our CMS Jahia 4.05 (www.jahia.org).
If a user authenticates himself on a Web page, we find the same Web page then with different URL since Jahia add at the end of the URL the jsessionid parameters. Here an example:
For the following Web page:
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659
If all the users are authenticate for this page, we will find in the Web cache of Jahia several Web pages:
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C5
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C4
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C3
etc.
If I put a filter in Swish to remove the pages containing the string ";jsessionid", I then remove the indexation of the http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659 page.
How to remove the indexing of all pages that contain the
";jsessionid" string without removing the single page without parameters (which is never call alone without parameters).
One solution would be to modify the URI variable in the subroutine
test_url in the Perl script "SwishSpiderConfigSit.pl" by adding the following line:
sub test_url {
my ( $uri, $server ) = @_;
# return 1; # Ok to index/spider
# return 0; # No, don't index or spider;
# ignore any common image files
# ***************************************************
# My new line
$uri->path =~ s/\;jses.+//;
# ***************************************************
return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;
}
The problem is that the URI variable is in read only. Does somebody have ideas or solutions to help me?
Thanks.
-------------------
French :
Nous utilisons le moteur de recherche Swish-e version 2.2.3 sur Linux RedHat pour indexer les pages sur notre CMS Jahia 4.05 (www.jahia.org).
Mais en indexant, plusieurs URL contiennent des ;jsessionid=... à la fin d'URL, qui faut enlever dans l'URL pendant l'indexation?
Exemple :
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659;jsessionid=F1DDB0B8E010D86DA452AA25B85749C5
http://www.sit.ulaval.ca/sgc/etudiants/cache/offonce/pid/2659 <-- OK
Nous avons essayé de modifier la routine test_URL du programme Perl SwishSpiderConfigSit.pl sans succès :
sub test_url {
my ( $uri, $server ) = @_;
# return 1; # Ok to index/spider
# return 0; # No, don't index or spider;
# Ajout de la ligne suivante pour enlever le ;jsessionid=... (sans succès)
$uri->path =~ s/\;jses.+//;
# ignore any common image files
return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;
}
Comment faire pour indexer les liens URL en enlevant sur les URL les ;jsessionid= .
Merci d'avance
Richard Vaillancourt
SIT, Division des systèmes
Pavillon Casault, Université Laval, Ste-Foy, Canada, G1K 7P4
Richard.Vaillancourt@sit.ulaval.ca
Tél: 418-656-2131 poste 6280, Télécopieur: 418-656-7305
www: http://www.sit.ulaval.ca/pp/rva/rva.html
Received on Fri Jul 29 08:19:42 2005