At 11:42 AM 04/22/02 -0700, Colin Kuskie wrote:
>I found that I was getting "duplicate" results when indexing:
>1000 http://www.sunsetpres.org/Men/ "Sunset Presbyterian Men's Ministry
>1000 http://www.sunsetpres.org/Men/index.html "Sunset Presbyterian Men's
Ministry Page" 29670
Two different URLs.
>Based on reading the docs, I expected it to merge the results for
>the two URLs, since they say:
> ReplaceRules allows you to make changes to file path
> names before they're indexed. These changed file
> names or URLs will be returned in search results.
Yes perhaps not the best wording.
You can change the name of of the path stored in the index with
ReplaceRules, but it doesn't effect what is sent to swish for indexing.
That's before indexing, not before spidering a URL.
In other words think of it as a pipe
spider | swish
spider is just passing files to swish, and swish can tell spider anything.
If you are using -S http you might be able to edit the swishspider perl
program and add "index.html" to any links that end in a slash. But that
won't fix links that forget the trailing slash (and generate a redirect).
You could even use MD5 in swishspider, but you would need to store the keys
on disk since swishspider is run for every URL.
-S prog with spider.pl is a lot more flexible. And probably faster, too,
since it avoids compiling a perl program for every URL.
Received on Mon Apr 22 19:02:01 2002