Skip to main content.
home | support | download

Back to List Archive

Re: Re: Inaccurate ranking?

From: Mark Gaulin <gaulin(at)not-real.globalspec.com>
Date: Thu Feb 25 1999 - 16:28:01 GMT
Hi Sam
I wish I had the time to dive into this, but I don't right now.
Understanding the existing ranking algorithm took a while
and these new routines are even more complex.  One
thing that did help me out when I was in there was to make
a spreadsheet that implemented the ranking algorithms,
make sure it works, and then compare it to swish as it is
running. I added a new trace level (4) that dumps the bits
I needed during the index process so I could plug those numbers 
into my spreadsheet model and see if it was working.
Sorry I can't be of more help right now.
	Mark

At 07:35 AM 2/25/99 -0800, sam1600@iname.com wrote:
>Mark,
>
>Thanks a lot for getting back to me!
>
>Sorry for taking so log to respond but I've
>been obsessed with the swish ranking function ;-)
>-
>
>I had "IgnoreTotalWordCountWhenRanking = yes"
>
>..when I sent my last email.  Your additional
>code does make a Huge improvement over not using
>it!
>
>But as you know my results where still unfavorable.
>
>So I decided to dive into the code myself and
>see if I could improve the rank function.
>
>I searched the web for a rank function and found
>a couple of resources.
>
>It appears that Dr. Dik L. Lee:
>http://www.cs.ust.hk/faculty/dlee/bio.html
>
>.. had a large part in "Document Ranking and the
>Vector-Space Mode" ( see his page above for
>downloadable publications on the topic.
>
>There is also another page authored in part by
>Dr. Lee on the topic of ranking which I found
>at W3C:
>http://www.w3.org/Conferences/WWW4/Papers/66/
>
>I got lucky when I found that page because the
>image of the scientific notation for the
>"ranking algorithm" has the algebraic equation
>as it's <Alt> text. ( My math skill are no longer
>up to par ;-)
>
>Here is Dr. Lee's equation and some explanation of the
>variables ( as taken from the www.w3.org page ):
>
>-----------
>R(i,Q) = Sum (for all term(j) in Q)(0.5 + 0.5 IDF(j)TF(i,j)/TF(i,max)
>
>where
>TF(i,j) is the term frequency of term(j) in document(i), and
>
>TF(i,max) is:
>the maximum term frequency of a keyword in document(i) and
>
>IDF(j) is the inverse document frequency of term(j),
>which is defined as in Equation 2 below:
>IDF(j) = log(N/DF(j))
>
>where N is the number of documents in the collection, and
>DF(j) is the number of documents containing term(j)
>
>-------------
>
>I think the above equation is a bit different than the one
>already used already in Swish.. The current getrank
>function does not include the "total number of files"
>and the " total number of files containg only the query word"
>
>So here is what I added/changed to the Swish index.c and
>index.h files...:
>
><BEGIN NEW GETRANK FUNCTION>:
>int getrank(totlfiles, nmfileswithword, freq, tfreq, words, emphasized)
>
>/* totlfiles=total number of files */
>int totlfiles;
>/* nmfileswithword=sum of only files containing query word*/
>int nmfileswithword;
>/* freq=sum of queryword in this ONE file containing queryword */
>int freq;
>/* tfreq=sum of queryword in ALL files containing queryword */
>int tfreq;
>
>int words;
>int emphasized;
>{
>
>double inversefreq, f;
> int tmprank;
>
>/*
>**my redering of the function found on
>**http://www.w3.org/Conferences/WWW4/Papers/66/
>*/
>inversefreq = log(totlfiles/nmfileswithword);
>f = ((tfreq) * (0.5 + (0.5 * (freq/50)))) * inversefreq;
>
>tmprank = (int) f;
> if (tmprank <= 0)
>  tmprank = 1;
> if (emphasized)
>  tmprank *= emphasized;
> if (!(tmprank % 128))
>  tmprank++;
>
> return tmprank;
>}
><END NEW GETRANK FUNCTION>
>
>So I also added/changed the following to the printindex function:
>
><BEGIN PRINTINDEX FUNCTION changes/additions>:
>
> int numfileswithword, myfilep thetotalfiles;
>
>/*added this loop to count only the number of files containg the queryword */
>
>numfileswithword = 0;
>while (lp != NULL) {
> numfileswithword++;
> lp = lp->next;
> }
>
>/* Here I set lp back to what it was before my loop */
>
>lp = ep->locationlist;
>
>/* Here I try to get the total number of files/pages
>** in the whole document structure...
>** I really don't know if my variable "thetotalfiles"
>** is being set because I don't know if the variable
>** "filelist" is set here inside the printindex function.
>** I don't do anything with filelist other than include
>** it here:
>*/
>
>   myfilep = filelist;
>   thetotalfiles = getfilecount(myfilep);
>
>/*Here I added the new parameters to the getrank function call */
>
>rank = getrank(thetotalfiles, numfileswithword, lp->frequency,
>ep->tfrequency, totalWords, lp->emphasized);
>
><END PRINTINDEX FUNCTION changes/additions>
>
>And finally I added a couple of variable declarations to the
>index.h file getrank declaration:
>
>int getrank _AP ((int, int, int, int, int, int));
>
>So,  As you can see My C skills are lacking :-0
>
>I might have made some stupid errors but it did compile.
>The ranking has NOT improved!  Maybe the index file
>is totally screwed up ( it does look a lot different than
>the old one)
>
>As you can see I have left out a bit of stuff from the rank
>function... including "IgnoreTotalWordCountWhenRanking".
>I thought I'd test it raw.
>
>Mark, can you ( or anyone else reading this ) see my
>mistakes?  Or could make improvements?
>
>Also Dr. Lee has a WEALTH of information on
>ranking algorithms on his page avaliable for download.
>(in Postscript format)
>Especially the following:
>
>1) "Document Ranking and the Vector-Space Model"
>2) "Search and Ranking Algorithms for Locating Resources
>on the World Wide Web"
>3) "Implementation of Partial Document Ranking Using
>Inverted Files"
>
>Dr Lee's ranking equations in those files mentioned
>above are much more complicated than the one I used.
>( well I can't figure them out anyway ;-)
>
>If there are any math whiz's out their maybe you could take
>a look and convert the equations into "simple" equations for
>me.
>
>Dr. lee's files are in postscript so you need a viewer to
>view them.  Download Ghostscript at:
>http://www.cs.wisc.edu/~ghost/cd.html
>
>I look forward to hearing from you all.
>
>Thanks,
>
>Sam
>
>
> ---- you wrote: 
>> Hi
>> The ranking is "complex". It uses the total number of words in
>> a file to spread out the "weight" of any given word more "evenly."
>> This behavior did not work for me so I added a new directive
>> called "IgnoreTotalWordCountWhenRanking". You should see t
>> this (commented out) in your config file. Uncomment it and
>> set it to "yes", then reindex.  This will cause the rank to be more 
>> in line with word count. Try this and see if it helps.
>> 	Mark
>> 
>> 
>> At 03:39 PM 2/20/99 -0800, sam1600@iname.com wrote:
>> >Hello,
>> >
>> >Sorry for posting what may be a blatantly newbie
>> >comment/question ;-) but for some reason a search
>> >on a particular keyword always returns an inaccurate
>> >ranking.  This keyword "gmc" occurs at least twice
>> >on every page ( once in a metatag and once in a link )
>> >but more than ten times on a particular page.
>> >( I have a search box on every page, and why
>> >people search for this word when there is a link is
>> >beyond me but they do anyway ;-)
>> >
>> >It is a small site with only a few dozen pages
>> >and if I search for this keyword that I know
>> >for sure occurs on a certain page more
>> >times than other pages, the said page is
>> >ranked far down the list.
>> >
>> >The command line is simple with just
>> >the -f -w and -m options specified.
>> >
>> >I've read in the mailing list that the ran
>> >algorithm takes a few things into account
>> >when ranking but I just can't see why it
>> >would override the total number of
>> >occurrences of a keyword as the most important
>> >criteria.
>> >
>> >I've been using Swish ( not Swish-e ) and have
>> >been logging the visitors search keywords
>> >( and this keyword is a popular one... hence
>> >the reason for me testing it ).  I'm not obsessed
>> >with this keyword ;-) i'm just curious how often the
>> >same inacurate ranking is occurring on other words also.
>> >
>> >Oh, by the way.  This bad ranking is NOT occurring
>> >with the old Swish.
>> >
>> >Comments anyone?
>> >
>> >Thanks,
>> >
>> >Sam
>> >
>> >
>> >
>> >----------------------------------------------------------------
>> >Get your free email from AltaVista at http://altavista.iname.com
>> > 
>> 
>
>
>----------------------------------------------------------------
>Get your free email from AltaVista at http://altavista.iname.com
> 
Received on Thu Feb 25 08:20:22 1999