Skip to main content.
home | support | download

Back to List Archive

Re: Duplicate files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Apr 26 2002 - 16:27:35 GMT
At 09:04 AM 04/26/02 -0700, GUEGAN Ronald wrote:
>Is there a way to detect that an HTML file as already been indexed ?
>
>We are indexing websites where a file can be accessed in various way :
>  - http://www.mysite.com/app1/page.asp?param=1&other=0
>  - http://www.mysite.com/app1/page.asp?param=1
>In the given example, both url could point to the same page.

If you are using (the soon to be a prelease) 2.1-dev version with -S prog
and spider.pl then yes, you can.  That spider has a MD5 option to
fingerprint each page, so that should catch duplicates.

We discussed this just a few days ago, so you might check the list
archives, too.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Apr 26 16:27:36 2002