Skip to main content.
home | support | download

Back to List Archive

Re: SWISH ranking vs. AltaVista

From: Bill Meier <bill(at)not-real.insulators.com>
Date: Mon Apr 30 2001 - 18:10:34 GMT
At 01:37 PM 04/30/01, David Wood wrote:
>Would you mind posting the actual code changes you made to index.c?  I'd 
>like to give it a try as well.

The piece of code now looks like:

/*    if (freq < 5)             Note #1
         freq = 5;
*/
     d = 1.0 / (double) tfreq;
     e = (log((double) freq) + 10.0) * d;
     if (ignoreTotalWordCountWhenRanking)        Note #2
     {
         /* scale the rank down a bit. a larger has the effect of
            making small differences in work frequency wash out */
         e /= 100;
     }
     else
     {
         e /= words;
     }
     f = e * 10000.0 * 100.0;            Note #3

I didn't add any comments, this is just in my private version at the 
moment... This addresses three issues:

1) ranking was computed the same whether a word occurred in a file 1 time 
or 5 times...

2) sense of ignoreTotalWordCountWhenRanking is backwards

3) The computation of rank, when the word is found in more than 100 or so 
files, ends up such a small integer as to wash out all differences.

Comments about the above:

1) I don't know why matches less than 5 were set to 5. I don't know if this 
creates any strange problems in the ranking, but in my limited testing 
improved the distribution of ranking. In my opinion, a document with 2 hits 
should be ranked below a document with 5 hits! The above change makes that 
happen...

2) This is clearly the right fix, except it will break everyone, since the 
sense is backwards. I found that dividing by the number of words so heavily 
weights the ranking towards smaller documents (even when others have title 
matches, or more word matches), that it gave a very inaccurate rank.

3) This is only a temporary work around. Note that the large ranks that can 
come out of this routine are OK, because in the end the highest rank is 
normalized to 1000 anyway. This is just to prevent ranks like 1.34231 and 
1.53422 from both turning into 1.

I personally found a positive effect from each of these changes, and all of 
the changed taken together gave far superior results, especially when 
looking for multiple words and/or words that appear in your files somewhat 
frequently.

My only theory about "how could this ever have worked" was that people must 
have primarily been  doing searches for words that only appeared in a few 
files. In this case, swish would find them all, but their ranks relative to 
each other would be nearly irrelevant...

Some of us also feel that the ranking for a hit in titles and the header is 
too high. Currently it boosts the overall rank by a factor of 5. But for 
now, I didn't try changing that.

I'd be curious if others try this and see how their searching and ranking 
improve. Be sure that you DO use "IgnoreTotalWordCountWhenRanking yes" !!! 
(which you probably had before, but now it does what you wanted ;-)

Try -H 9 -- you can see the raw ranking numbers this way. Try it before and 
after this change!

Bill

P.S. If you try this, and think it helps your searches, let us know!
Received on Mon Apr 30 18:11:46 2001