I suppose it depends on what you consider to be 'similar'.
<p>
The cat sat on the mat.
</p>
<p>
The mat sat on the cat.
</p>
from an indexing point of view, you might consider those 99% the same. Same
words, different order.
from a semantic/logical point of view, they communicate something totally
different, yes?
one thing I might try would be to ignore words with a high Index Frequency. What
we normally consider StopWords. I would play with the IgnoreWords config setting
to try that out. That way you could separate the chaff (so to speak) from the
words that "matter".
mark@workzoo.com wrote on 2/6/05 1:31 AM:
> Hi,
>
> Is there a way to algorithmically calculate the similarity between two
> chunks of html as some sort of index? Perhaps a float value between 0 and 1
> where 1 is exactly the same and 0 is 100% different? I'm trying to remove
> very similar documents from our swish index.
>
> I'd really appreciate any help you can offer because I've been struggling
> with this for some time.
>
> Thanks,
>
> Mark.
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Sun Feb 6 06:22:12 2005