Skip to main content.
home | support | download

Back to List Archive

Re: SWISH ranking vs. AltaVista

From: David Wood <dwood(at)not-real.inter.nl.net>
Date: Mon Apr 30 2001 - 19:03:10 GMT
Hi all,

I applied Bill's ignoreTotalWordCountWhenRanking  fix described below and 
the results are indeed dramatic.  We always used to get a few 100% docs and 
then lots of 1%'s and 2%'s, now we get a pretty nice distribution.

cheers,

David


At 20:06 30-04-01, Bill Meier wrote:
>At 01:37 PM 04/30/01, David Wood wrote:
>>Would you mind posting the actual code changes you made to index.c?  I'd 
>>like to give it a try as well.
>
>The piece of code now looks like:
>
>/*    if (freq < 5)             Note #1
>         freq = 5;
>*/
>     d = 1.0 / (double) tfreq;
>     e = (log((double) freq) + 10.0) * d;
>     if (ignoreTotalWordCountWhenRanking)        Note #2
>     {
>         /* scale the rank down a bit. a larger has the effect of
>            making small differences in work frequency wash out */
>         e /= 100;
>     }
>     else
>     {
>         e /= words;
>     }
>     f = e * 10000.0 * 100.0;            Note #3
>
>I didn't add any comments, this is just in my private version at the 
>moment... This addresses three issues:
>
>1) ranking was computed the same whether a word occurred in a file 1 time 
>or 5 times...
>
>2) sense of ignoreTotalWordCountWhenRanking is backwards
>
>3) The computation of rank, when the word is found in more than 100 or so 
>files, ends up such a small integer as to wash out all differences.
>
>Comments about the above:
>
>1) I don't know why matches less than 5 were set to 5. I don't know if 
>this creates any strange problems in the ranking, but in my limited 
>testing improved the distribution of ranking. In my opinion, a document 
>with 2 hits should be ranked below a document with 5 hits! The above 
>change makes that happen...
>
>2) This is clearly the right fix, except it will break everyone, since the 
>sense is backwards. I found that dividing by the number of words so 
>heavily weights the ranking towards smaller documents (even when others 
>have title matches, or more word matches), that it gave a very inaccurate rank.
>
>3) This is only a temporary work around. Note that the large ranks that 
>can come out of this routine are OK, because in the end the highest rank 
>is normalized to 1000 anyway. This is just to prevent ranks like 1.34231 
>and 1.53422 from both turning into 1.
>
>I personally found a positive effect from each of these changes, and all 
>of the changed taken together gave far superior results, especially when 
>looking for multiple words and/or words that appear in your files somewhat 
>frequently.
>
>My only theory about "how could this ever have worked" was that people 
>must have primarily been  doing searches for words that only appeared in a 
>few files. In this case, swish would find them all, but their ranks 
>relative to each other would be nearly irrelevant...
>
>Some of us also feel that the ranking for a hit in titles and the header 
>is too high. Currently it boosts the overall rank by a factor of 5. But 
>for now, I didn't try changing that.
>
>I'd be curious if others try this and see how their searching and ranking 
>improve. Be sure that you DO use "IgnoreTotalWordCountWhenRanking yes" !!! 
>(which you probably had before, but now it does what you wanted ;-)
>
>Try -H 9 -- you can see the raw ranking numbers this way. Try it before 
>and after this change!
>
>Bill
>
>P.S. If you try this, and think it helps your searches, let us know!
Received on Mon Apr 30 19:03:48 2001