Skip to main content.
home | support | download

Back to List Archive

Re: MD5 not filtering out 'variant' querystrings

From: Tim Hartley <tim.hartley(at)not-real.planetpdf.com>
Date: Thu Dec 16 2004 - 06:28:07 GMT
Hi Bill,

>If you are seeing duplicate documents then it should mean that the content is actually different.

Yup I know, which was why I was getting stuck. It's definitely just one page with many links to it, and the links contain various tracking tags for the marketing dept's traffic reports.

Anyway, I got around it by inserting the following code into my test_response sub:

#only need to check details.asp page
if ($uri->path =~ /details\.asp$/)
	{
	my $test_param = $uri->query_param;
	if(defined($test_param))
		{
		#if there's more than one parameter, consider it a duplication & remove the sucka
		if ($test_param==2){return 0;}
		}
	}

Regards,

Tim



-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu]On Behalf Of Bill Moseley
Sent: Friday, 3 December 2004 1:19 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: MD5 not filtering out 'variant' querystrings


On Wed, Dec 01, 2004 at 07:26:36PM -0800, Tim Hartley wrote:
> One of the searches that I've set up on my site is returning
> multiple hits of the same page. The hits show that the only
> differences are slight variances in the querystrings.
> http://blah.com?details.asp?prodid=615,
> http://blah.comdetails.asp?prodid=615&fa, etc etc. I'm using MD5 in
> my spider configuration but it doesn't seem to clear these
> duplications up, and in some cases I get up to six varied
> querystrings for the same page. Any clues? Details follow.

The MD5 hash is generated from the content -- well, it's generated
from the content unless there's a "Content-MD5" header returned from
the web server.

If you are seeing duplicate documents then it should mean that the
content is actually different.

Try fetching (with wget or GET) your documents that should be the same
and run md5sum on them (or just diff).

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Dec 15 22:28:23 2004