I applied Bill's ignoreTotalWordCountWhenRanking fix described below and
the results are indeed dramatic. We always used to get a few 100% docs and
then lots of 1%'s and 2%'s, now we get a pretty nice distribution.
At 20:06 30-04-01, Bill Meier wrote:
>At 01:37 PM 04/30/01, David Wood wrote:
>>Would you mind posting the actual code changes you made to index.c? I'd
>>like to give it a try as well.
>The piece of code now looks like:
>/* if (freq < 5) Note #1
> freq = 5;
> d = 1.0 / (double) tfreq;
> e = (log((double) freq) + 10.0) * d;
> if (ignoreTotalWordCountWhenRanking) Note #2
> /* scale the rank down a bit. a larger has the effect of
> making small differences in work frequency wash out */
> e /= 100;
> e /= words;
> f = e * 10000.0 * 100.0; Note #3
>I didn't add any comments, this is just in my private version at the
>moment... This addresses three issues:
>1) ranking was computed the same whether a word occurred in a file 1 time
>or 5 times...
>2) sense of ignoreTotalWordCountWhenRanking is backwards
>3) The computation of rank, when the word is found in more than 100 or so
>files, ends up such a small integer as to wash out all differences.
>Comments about the above:
>1) I don't know why matches less than 5 were set to 5. I don't know if
>this creates any strange problems in the ranking, but in my limited
>testing improved the distribution of ranking. In my opinion, a document
>with 2 hits should be ranked below a document with 5 hits! The above
>change makes that happen...
>2) This is clearly the right fix, except it will break everyone, since the
>sense is backwards. I found that dividing by the number of words so
>heavily weights the ranking towards smaller documents (even when others
>have title matches, or more word matches), that it gave a very inaccurate rank.
>3) This is only a temporary work around. Note that the large ranks that
>can come out of this routine are OK, because in the end the highest rank
>is normalized to 1000 anyway. This is just to prevent ranks like 1.34231
>and 1.53422 from both turning into 1.
>I personally found a positive effect from each of these changes, and all
>of the changed taken together gave far superior results, especially when
>looking for multiple words and/or words that appear in your files somewhat
>My only theory about "how could this ever have worked" was that people
>must have primarily been doing searches for words that only appeared in a
>few files. In this case, swish would find them all, but their ranks
>relative to each other would be nearly irrelevant...
>Some of us also feel that the ranking for a hit in titles and the header
>is too high. Currently it boosts the overall rank by a factor of 5. But
>for now, I didn't try changing that.
>I'd be curious if others try this and see how their searching and ranking
>improve. Be sure that you DO use "IgnoreTotalWordCountWhenRanking yes" !!!
>(which you probably had before, but now it does what you wanted ;-)
>Try -H 9 -- you can see the raw ranking numbers this way. Try it before
>and after this change!
>P.S. If you try this, and think it helps your searches, let us know!
Received on Mon Apr 30 19:03:48 2001