Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:440] Ranking algorithm is flawed when more than two search terms...

From: Mark Gaulin <gaulin(at)not-real.designinfo.com>
Date: Thu Aug 13 1998 - 19:01:18 GMT
I actually removed the file size part of the ranking algorithm in my copy
(conditionally compiled it out, to be exact) yesterday and it helped my 
situation out tremendously. I already have a fix in place for the ANDing
problem (it only took a minor tweak to do the way I wanted), and with
the Stemming routine in place I get results that are better and more
predictable than Microsoft's Index Server.

	Mark Gaulin

At 11:14 AM 8/13/98 -0700, Roy Tennant wrote:
>Mark,
>Wow, a fairly thorough evaluation, but to me it is like being concerned
>about the size of your stateroom while the ship is sinking. SWISH (and by
>extension SWISH-E) ranking is a piece of garbage (in my opinion). It dates
>back to WAIS days, when all they did was count up the word occurrence and
>factor in file size. This gives you small files ranking high. Whoopee. I
>long ago gave up on relevance ranking entirely and try to either place the
>hits in context (such as the main server section in which they are found)
>or alphabetize them. Although I applaud your work in finding this ranking
>abnormality, I am not enthusiastic about spending much time "fixing" it,
>since to me that is but the tip of the iceberg.
>Roy
>
>On Wed, 12 Aug 1998, Mark Gaulin wrote:
>
>> Hi
>> Ranking has got to be a bit of a black art, so I am not about to
criticize the
>> exact ranking weights, etc, but there does seem to be a basic flaw in
the way 
>> swish-e 1.1 combines the ranks when doing searches of three or more search
>> words.
>> 
>> Below is the "proof", but I'll start with the conclusion:
>> When searching for three words (using AND), the final rank of a file
will be 
>> rank(word1) * 25% + rank(word2) * 25% + rank(word3) * 50%
>> and this is counter intuitive.
>> The desired ranking (in my opinion) is to weight each of the word ranks at
>> 33%,
>> and that is what I plan on implementing.  Any other suggestions?
>> 
>> Proof: Every word in a file has a rank (call it "rank(file, word)"). 
>> When searching for two words the steps are to find all of the files that
>> contain 
>> word1 (store in list1), then find the files that contain word2 (store in
>> list2), and 
>> then do the AND. To do the AND you find all of the files that are in both
>> list1 
>> and list2 and store them in list1&2. Simple.
>> The trick is computing a new "rank" value when you take a given file from
>> list1
>> (where the file is ranked using "word1") and list2 (where the same file is
>> ranked
>> using "word2") and place it in list1&2. The current algorithm averages to
>> two ranks.
>> So, the rank of "file1" would look like this:
>> rank(file1, "word1 AND word2") = 
>> 	rank(file1, "word1") * 50% + rank(file1, "word2") * 50%
>> That feels ok, intuitively... we weight each of the words "the same".
>> 
>> Trouble: When there are three words (word1, word2, word3) the algorithm
will
>> do all of the above steps with list1 and list2, and it will create list1&2
>> exactly
>> the same way, yielding the exact same rank for file1. 
>> Then we produce list3, which has all files that contain word3 and use
the same
>> ANDing technique above to create list(1&2)&3. And then we rank using
>> the averaging technique:
>> rank(file1, "word1+word2 AND word3") =
>> 	rank(file1, "word1+word2") * 50% + rank(file1, "word3") * 50%
>> but rank(file1, "word1+word2") = 
>> 	the previously computed rank  =
>> 	rank(file1, "word1") * 50% + rank(file1, "word2") * 50%
>> so the final ranking for file1 is
>> rank(file1, "word1+word2 AND work3") =
>> 	(rank(file1, "word1") * 50% + rank(file1, "word2") * 50% )* 50% + 
>> 	rank(file1, "word3") * 50%
>> and so we are weighing the rank of word3 at 50% while word1 and word2 are
>> only weighed at 25%.... the last search term is always weighed at 50% and
>> the other terms are always spread out over the remaining 50%.
>> 
>> 	Mark
>> 
>> 
> 
Received on Thu Aug 13 12:12:30 1998