Skip to main content.
home | support | download

Back to List Archive

Re: MD5 not filtering out 'variant' querystrings

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Dec 02 2004 - 14:19:24 GMT
On Wed, Dec 01, 2004 at 07:26:36PM -0800, Tim Hartley wrote:
> One of the searches that I've set up on my site is returning
> multiple hits of the same page. The hits show that the only
> differences are slight variances in the querystrings.
> http://blah.com?details.asp?prodid=615,
> http://blah.comdetails.asp?prodid=615&fa, etc etc. I'm using MD5 in
> my spider configuration but it doesn't seem to clear these
> duplications up, and in some cases I get up to six varied
> querystrings for the same page. Any clues? Details follow.

The MD5 hash is generated from the content -- well, it's generated
from the content unless there's a "Content-MD5" header returned from
the web server.

If you are seeing duplicate documents then it should mean that the
content is actually different.

Try fetching (with wget or GET) your documents that should be the same
and run md5sum on them (or just diff).

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Dec 2 06:19:25 2004