On Wed, Dec 01, 2004 at 07:26:36PM -0800, Tim Hartley wrote:
> One of the searches that I've set up on my site is returning
> multiple hits of the same page. The hits show that the only
> differences are slight variances in the querystrings.
> http://blah.com?details.asp?prodid=615,
> http://blah.comdetails.asp?prodid=615&fa, etc etc. I'm using MD5 in
> my spider configuration but it doesn't seem to clear these
> duplications up, and in some cases I get up to six varied
> querystrings for the same page. Any clues? Details follow.
The MD5 hash is generated from the content -- well, it's generated
from the content unless there's a "Content-MD5" header returned from
the web server.
If you are seeing duplicate documents then it should mean that the
content is actually different.
Try fetching (with wget or GET) your documents that should be the same
and run md5sum on them (or just diff).
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Thu Dec 2 06:19:25 2004