On Tue, Aug 09, 2005 at 11:17:56AM -0700, Bill Moseley wrote:
> Another option is to just keep your own cache of URLs:
>
> # Avoid duplicate URLs
> my $new_url = url with jsessionid removed;
> return !$seen_url_before{ $new_url }++;
In private email you sent:
sub test_url {
my ( $uri, $server ) = @_;
# return 1; # Ok to index/spider
# return 0; # No, don't index or spider;
# always use the same jsessionid, regardless of what the server
# tells us.
my $new_url = url with jsessionid removed;
return !$seen_url_before{ $new_url }++;
# ignore any common image files
# $uri->path =~ s/\;jses.+//;
return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;
}
Sorry, that was just a suggestion, not actual Perl code.
Also, that first return !$seen... means the rest of your code is
skipped.
my %seen_uri;
[later]
sub test_url {
my ( $uri_orig ) = @_;
my $uri = $uri_orig->clone; # don't want to change original;
# Warning -- destroys multiple parameter
my %params = $uri->query_form; # grab parameters and store as a hash
delete $params{jsessionid}; # delete jsessionid
if ( %params ) {
$uri->query_form( %params ); # update the parameter list
} else {
$uri->query( undef ); # or just erase the params.
}
return if $seen_uri{ $uri }++; # return (false) if seen this URL before.
[...]
BTW -- you posted this:
http://www.sit.ulaval.ca/sgc/nous_joindre/cache/offonce;jsessionid=ECB59031040C6AD9AA3B49FFD2EDFC5E
That's not a proper URL. Should be a '?', not a ';' The above WILL
NOT work if you use those broken URLs.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Tue Aug 9 12:51:48 2005