Skip to main content.
home | support | download

Back to List Archive

Re: Calculating similarity index between html files

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Tue Feb 08 2005 - 14:04:58 GMT
Mark Maunder wrote on 2/6/05 11:49 AM:

> An interesting feature in swish might be to have a config option to
> remove duplicates while indexing. The implementation might calculate the
> levenshtein distance of each field added to every other field that has a
> set of predefined MetaNames equal. In other words, it only calculates
> the LD for all docs that have the same title and base url, for example.
> Then it only preserves the most recent document of the duplicates.

That's an interesting theory to play with. I'll have to look into it more. Some 
derivation might be useful for a ranking scheme.

However, I think it's beyond the bounds of Swish-e's mission to include that 
kind of feature on the indexing side. Swish-e does one thing well: index and 
search files. The more features we add to it, the less likely it will be to do 
its main job, quickly. Judging from the number of emails on this list about 
folks using Swish-s to index million+ docs, I think it's already being stretched 
beyond its original intentions. I'm waiting for the email that says, "I'm using 
Swish-e to index a billion docs and my machine started dancing around on the 
table and smoking like a chimney!"

If you're using -S prog (which spider.pl does, IIRC), then that sounds like a 
perfect candidate for a hook or callback to compare docs before passing on to 
Swish-e to index.

IMHO, Swish-e should handle whatever you hand to it, quickly, at least up to a 
(as yet undefined?) scale. What you hand to it, using whatever algorithms you 
might devise, can (and should?) vary in the application.



-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Tue Feb 8 06:05:06 2005