The individual ranks of single words do have a "frequency" component,
including the frequency of the word within *all* of the files that are being
indexed (total frequency). In fact, the computer rank is very heavily
dependent on the total frequency... the rank is first computed using
the word's frequency within the file and the number of words in the file,
and then it is divided by the total frequency. That seems harsh.
(Side effect: merging indexes will give different rankings than if the files
were indexed all at once.)
It does not use relative position in the file, however, and I think it would
be cool if it did, but I'm not sure how to approach it.
Anyway, like I said before, the coming up with a good ranking mechanism
has got to be more a black art, and as long as it is consistently applied
I can live with it for now.
FYI, here is the existing ranking function:
rank = ((log(max(freqInFile, 5)) + 10) / freqInAllFiles) / numWordsInFile *
(There is a scale factor applied if the word is "emphasized" in the file)
>You're right, this doesn't make much sense. But if we're going to go to
>the trouble, the ranking algorithm(s) should go a bit deeper than even
>For example, if a search term occurs more frequently, or earlier, than in
>other documents, the document should be ranked higher.
>There are lots of other considerations.
Received on Wed Aug 12 11:38:24 1998