From: Mark Gaulin <gaulin(at)not-real.designinfo.com>

Date: Wed Aug 12 1998 - 15:56:46 GMT

Date: Wed Aug 12 1998 - 15:56:46 GMT

Hi Ranking has got to be a bit of a black art, so I am not about to criticize the exact ranking weights, etc, but there does seem to be a basic flaw in the way swish-e 1.1 combines the ranks when doing searches of three or more search words. Below is the "proof", but I'll start with the conclusion: When searching for three words (using AND), the final rank of a file will be rank(word1) * 25% + rank(word2) * 25% + rank(word3) * 50% and this is counter intuitive. The desired ranking (in my opinion) is to weight each of the word ranks at 33%, and that is what I plan on implementing. Any other suggestions? Proof: Every word in a file has a rank (call it "rank(file, word)"). When searching for two words the steps are to find all of the files that contain word1 (store in list1), then find the files that contain word2 (store in list2), and then do the AND. To do the AND you find all of the files that are in both list1 and list2 and store them in list1&2. Simple. The trick is computing a new "rank" value when you take a given file from list1 (where the file is ranked using "word1") and list2 (where the same file is ranked using "word2") and place it in list1&2. The current algorithm averages to two ranks. So, the rank of "file1" would look like this: rank(file1, "word1 AND word2") = rank(file1, "word1") * 50% + rank(file1, "word2") * 50% That feels ok, intuitively... we weight each of the words "the same". Trouble: When there are three words (word1, word2, word3) the algorithm will do all of the above steps with list1 and list2, and it will create list1&2 exactly the same way, yielding the exact same rank for file1. Then we produce list3, which has all files that contain word3 and use the same ANDing technique above to create list(1&2)&3. And then we rank using the averaging technique: rank(file1, "word1+word2 AND word3") = rank(file1, "word1+word2") * 50% + rank(file1, "word3") * 50% but rank(file1, "word1+word2") = the previously computed rank = rank(file1, "word1") * 50% + rank(file1, "word2") * 50% so the final ranking for file1 is rank(file1, "word1+word2 AND work3") = (rank(file1, "word1") * 50% + rank(file1, "word2") * 50% )* 50% + rank(file1, "word3") * 50% and so we are weighing the rank of word3 at 50% while word1 and word2 are only weighed at 25%.... the last search term is always weighed at 50% and the other terms are always spread out over the remaining 50%. MarkReceived on Wed Aug 12 09:06:18 1998