Calculating similarity index between html files

From: <mark(at)>
Date: Sun Feb 06 2005 - 07:32:31 GMT

Is there a way to algorithmically calculate the similarity between two
chunks of html as some sort of index? Perhaps a float value between 0 and 1
where 1 is exactly the same and 0 is 100% different? I'm trying to remove
very similar documents from our swish index.

I'd really appreciate any help you can offer because I've been struggling
with this for some time.


Received on Sat Feb 5 23:32:40 2005