On Mon, 2 Dec 2002 tom12@bluemail.ch wrote:
> Hello
>
> Is there a possibility in swish-e to eliminate repeated urls when indexing
> or searching? So when searching with the keyword Baghdad it will return
> www.cnn.com and http://www.cnn.com/2002/WORLD/meast/12/02/sproject.irq.inspectors/index.html.
> So it would be nice when I would recieve only the last one. How have I to
> do that? Unfortunately, I wasn't able to find any information in the documentation.
Why do you say those are duplicate URLs?
During spidering (with -S prog and spider.pl) you can use MD5 checksums
to avoid duplicate content. You can also simply reject some URLs from
indexing in the spider config file.
--
Bill Moseley moseley@hank.org
Received on Mon Dec 2 14:06:25 2002