I posted some changes to rank.c to implement a simple version of IDF
(Inverse Document Frequency) weighting to docs. The code is far from
optimal. I think of it more as a proof of concept that I wanted others
to test out.
This change is now in the the current CVS snapshot. If you would like to
test it out, we'd appreciate feedback on whether this is a good
direction to move with the ranking code. See if it makes much of a
difference in the rankings of your searches, and if it makes a "good"
difference in the rankings.
Get the CVS snapshot at:
A little ranking primer (Bill, correct me if I don't get this exactly
For every doc that matches a query, each word in the query that appears
in that doc is given a score. Then all the scores are summed, and the
docs are normalized to a rank of 1000 on down.
In the 2.4.2 release, the word score is calculated based on the
frequency of the word and the position of the word (in a metaname, in a
title, etc.). The word score is based only the frequency/structure in
In the current CVS snap, score is based in part on the frequency of a
word across the whole index. The word score starts the same way, but
then a IDF weight is multiplied against the score. The effect should be
that words that appear in fewer documents in the index have a higher IDF
(greater weight) than words that appear in lots of documents, and thus
cause those docs in which they appear to rank higher.
For example, if the word 'the' appears in 98% of the docs in your index,
it will have an IDF of 1. If the word 'foo' appears in 10% of your docs,
it will have an IDF of something greater than 1 (something like 5 or 6,
depending on the math, number of docs, etc.). So for a query of 'the
foo', docs with more instances of 'foo' will rank relatively higher than
docs with fewer instances of 'foo', while instances of 'the' will affect
ranking much the same way they do now (that is to say, not much).
IDF has a similar effect to IgnoreLimit or StopWords, but on a smoother
scale. A word isn't just in or out (a StopWord or not), but rather has a
relative weight compared to all the other word in the index.
I have several other new ranking features in the works, but wanted to
get some feedback for this one before I move ahead too much in this
direction. Other features might include:
normalizing weight for word density/document length
scaling the IDF to allow for greater granularity in difference
weighting words based on their proximity to other query words
Please send your feedback to the list.
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:firstname.lastname@example.org
Received on Wed Aug 4 04:11:47 2004