Skip to main content.
home | support | download

Back to List Archive

ranking change

From: Peter Karman <karman(at)not-real.cray.com>
Date: Wed Aug 04 2004 - 11:11:31 GMT
I posted some changes to rank.c to implement a simple version of IDF 
(Inverse Document Frequency) weighting to docs. The code is far from 
optimal. I think of it more as a proof of concept that I wanted others 
to test out.

This change is now in the the current CVS snapshot. If you would like to 
test it out, we'd appreciate feedback on whether this is a good 
direction to move with the ranking code. See if it makes much of a 
difference in the rankings of your searches, and if it makes a "good" 
difference in the rankings.

Get the CVS snapshot at:
http://swish-e.org/dev/swish-daily/

A little ranking primer (Bill, correct me if I don't get this exactly 
right).

For every doc that matches a query, each word in the query that appears 
in that doc is given a score. Then all the scores are summed, and the 
docs are normalized to a rank of 1000 on down.

In the 2.4.2 release, the word score is calculated based on the 
frequency of the word and the position of the word (in a metaname, in a 
title, etc.). The word score is based only the frequency/structure in 
each doc.

In the current CVS snap, score is based in part on the frequency of a 
word across the whole index. The word score starts the same way, but 
then a IDF weight is multiplied against the score. The effect should be 
that words that appear in fewer documents in the index have a higher IDF 
(greater weight) than words that appear in lots of documents, and thus 
cause those docs in which they appear to rank higher.

For example, if the word 'the' appears in 98% of the docs in your index, 
it will have an IDF of 1. If the word 'foo' appears in 10% of your docs, 
it will have an IDF of something greater than 1 (something like 5 or 6, 
depending on the math, number of docs, etc.). So for a query of 'the 
foo', docs with more instances of 'foo' will rank relatively higher than 
docs with fewer instances of 'foo', while instances of 'the' will affect 
ranking much the same way they do now (that is to say, not much).

IDF has a similar effect to IgnoreLimit or StopWords, but on a smoother 
scale. A word isn't just in or out (a StopWord or not), but rather has a 
relative weight compared to all the other word in the index.

I have several other new ranking features in the works, but wanted to 
get some feedback for this one before I move ahead too much in this 
direction. Other features might include:

	normalizing weight for word density/document length
	scaling the IDF to allow for greater granularity in difference
	weighting words based on their proximity to other query words

Please send your feedback to the list.

thanks!
-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Wed Aug 4 04:11:47 2004