I suppose it depends on what you consider to be 'similar'.
The cat sat on the mat.
The mat sat on the cat.
from an indexing point of view, you might consider those 99% the same. Same
words, different order.
from a semantic/logical point of view, they communicate something totally
one thing I might try would be to ignore words with a high Index Frequency. What
we normally consider StopWords. I would play with the IgnoreWords config setting
to try that out. That way you could separate the chaff (so to speak) from the
words that "matter".
firstname.lastname@example.org wrote on 2/6/05 1:31 AM:
> Is there a way to algorithmically calculate the similarity between two
> chunks of html as some sort of index? Perhaps a float value between 0 and 1
> where 1 is exactly the same and 0 is 100% different? I'm trying to remove
> very similar documents from our swish index.
> I'd really appreciate any help you can offer because I've been struggling
> with this for some time.
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Sun Feb 6 06:22:12 2005