Skip to main content.
home | support | download

Back to List Archive

Re: Re: Inaccurate ranking?

From: <sam1600(at)not-real.iname.com>
Date: Fri Feb 26 1999 - 11:44:35 GMT
Mark,

If you find the time to take a another look at the ranking algorithms again please let me know.  I'm going to put my attempt at improving it on hold for a while.  My lack of experience with C is the reason.  As I see it my new ranking function should have yielded better results. It did when I tested it useing PHP ( www.php.net ):

<?php
/*change the values to test results*/
$totlefiles = 1000;         
$numfileswithword = 89;    
$freq = 2;                  
$tfreq = 300;                

$inversefreq = log($totlefiles/$numfileswithword);
$tmprank = (($tfreq)*(0.5 + (0.5*($freq/50))))*$inversefreq;

/* $tmprank = (int) tmprank; */
settype($tmprank, "integer");

if ($tmprank <= 0):
	$tmprank = 1;
endif;

if (!($tmprank % 128)):
	$tmprank++;
endif;

 /* return tmprank; */
echo "tmprank is: $tmprank";

?>

PHP is great.  It's so close to C and no need to compile.

I guess my new ranking function did not produce the results I was hoping for because the rank function is only a small part of the equation. How search.c handles the rank value is probably just as important and I havn't taken a close look at this.

Thanks a lot and I hope to hear from you again.

-Sam




 ---- you wrote: 
> Hi Sam
> I wish I had the time to dive into this, but I don't right now.
> Understanding the existing ranking algorithm took a while
> and these new routines are even more complex.  One
> thing that did help me out when I was in there was to make
> a spreadsheet that implemented the ranking algorithms,
> make sure it works, and then compare it to swish as it is
> running. I added a new trace level (4) that dumps the bits
> I needed during the index process so I could plug those numbers 
> into my spreadsheet model and see if it was working.
> Sorry I can't be of more help right now.
> 	Mark
> 
> At 07:35 AM 2/25/99 -0800, sam1600@iname.com wrote:
> >Mark,
> >
> >Thanks a lot for getting back to me!
> >
> >Sorry for taking so log to respond but I've
> >been obsessed with the swish ranking function ;-)
> >-
> >
> >I had "IgnoreTotalWordCountWhenRanking = yes"
> >
> >..when I sent my last email.  Your additional
> >code does make a Huge improvement over not using
> >it!
> >
> >But as you know my results where still unfavorable.
> >
> >So I decided to dive into the code myself and
> >see if I could improve the rank function.
> >
> >I searched the web for a rank function and found
> >a couple of resources.
> >
> >It appears that Dr. Dik L. Lee:
> >http://www.cs.ust.hk/faculty/dlee/bio.html
> >
> >.. had a large part in "Document Ranking and the
> >Vector-Space Mode" ( see his page above for
> >downloadable publications on the topic.
> >
> >There is also another page authored in part by
> >Dr. Lee on the topic of ranking which I found
> >at W3C:
> >http://www.w3.org/Conferences/WWW4/Papers/66/
> >
> >I got lucky when I found that page because the
> >image of the scientific notation for the
> >"ranking algorithm" has the algebraic equation
> >as it's <Alt> text. ( My math skill are no longer
> >up to par ;-)
> >
> >Here is Dr. Lee's equation and some explanation of the
> >variables ( as taken from the www.w3.org page ):
> >
> >-----------
> >R(i,Q) = Sum (for all term(j) in Q)(0.5 + 0.5 IDF(j)TF(i,j)/TF(i,max)
> >
> >where
> >TF(i,j) is the term frequency of term(j) in document(i), and
> >
> >TF(i,max) is:
> >the maximum term frequency of a keyword in document(i) and
> >
> >IDF(j) is the inverse document frequency of term(j),
> >which is defined as in Equation 2 below:
> >IDF(j) = log(N/DF(j))
> >
> >where N is the number of documents in the collection, and
> >DF(j) is the number of documents containing term(j)
> >
> >-------------
> >
> >I think the above equation is a bit different than the one
> >already used already in Swish.. The current getrank
> >function does not include the "total number of files"
> >and the " total number of files containg only the query word"
> >
> >So here is what I added/changed to the Swish index.c and
> >index.h files...:
> >
> ><BEGIN NEW GETRANK FUNCTION>:
> >int getrank(totlfiles, nmfileswithword, freq, tfreq, words, emphasized)
> >
> >/* totlfiles=total number of files */
> >int totlfiles;
> >/* nmfileswithword=sum of only files containing query word*/
> >int nmfileswithword;
> >/* freq=sum of queryword in this ONE file containing queryword */
> >int freq;
> >/* tfreq=sum of queryword in ALL files containing queryword */
> >int tfreq;
> >
> >int words;
> >int emphasized;
> >{
> >
> >double inversefreq, f;
> > int tmprank;
> >
> >/*
> >**my redering of the function found on
> >**http://www.w3.org/Conferences/WWW4/Papers/66/
> >*/
> >inversefreq = log(totlfiles/nmfileswithword);
> >f = ((tfreq) * (0.5 + (0.5 * (freq/50)))) * inversefreq;
> >
> >tmprank = (int) f;
> > if (tmprank <= 0)
> >  tmprank = 1;
> > if (emphasized)
> >  tmprank *= emphasized;
> > if (!(tmprank % 128))
> >  tmprank++;
> >
> > return tmprank;
> >}
> ><END NEW GETRANK FUNCTION>
> >
> >So I also added/changed the following to the printindex function:
> >
> ><BEGIN PRINTINDEX FUNCTION changes/additions>:
> >
> > int numfileswithword, myfilep thetotalfiles;
> >
> >/*added this loop to count only the number of files containg the queryword */
> >
> >numfileswithword = 0;
> >while (lp != NULL) {
> > numfileswithword++;
> > lp = lp->next;
> > }
> >
> >/* Here I set lp back to what it was before my loop */
> >
> >lp = ep->locationlist;
> >
> >/* Here I try to get the total number of files/pages
> >** in the whole document structure...
> >** I really don't know if my variable "thetotalfiles"
> >** is being set because I don't know if the variable
> >** "filelist" is set here inside the printindex function.
> >** I don't do anything with filelist other than include
> >** it here:
> >*/
> >
> >   myfilep = filelist;
> >   thetotalfiles = getfilecount(myfilep);
> >
> >/*Here I added the new parameters to the getrank function call */
> >
> >rank = getrank(thetotalfiles, numfileswithword, lp->frequency,
> >ep->tfrequency, totalWords, lp->emphasized);
> >
> ><END PRINTINDEX FUNCTION changes/additions>
> >
> >And finally I added a couple of variable declarations to the
> >index.h file getrank declaration:
> >
> >int getrank _AP ((int, int, int, int, int, int));
> >
> >So,  As you can see My C skills are lacking :-0
> >
> >I might have made some stupid errors but it did compile.
> >The ranking has NOT improved!  Maybe the index file
> >is totally screwed up ( it does look a lot different than
> >the old one)
> >
> >As you can see I have left out a bit of stuff from the rank
> >function... including "IgnoreTotalWordCountWhenRanking".
> >I thought I'd test it raw.
> >
> >Mark, can you ( or anyone else reading this ) see my
> >mistakes?  Or could make improvements?
> >
> >Also Dr. Lee has a WEALTH of information on
> >ranking algorithms on his page avaliable for download.
> >(in Postscript format)
> >Especially the following:
> >
> >1) "Document Ranking and the Vector-Space Model"
> >2) "Search and Ranking Algorithms for Locating Resources
> >on the World Wide Web"
> >3) "Implementation of Partial Document Ranking Using
> >Inverted Files"
> >
> >Dr Lee's ranking equations in those files mentioned
> >above are much more complicated than the one I used.
> >( well I can't figure them out anyway ;-)
> >
> >If there are any math whiz's out their maybe you could take
> >a look and convert the equations into "simple" equations for
> >me.
> >
> >Dr. lee's files are in postscript so you need a viewer to
> >view them.  Download Ghostscript at:
> >http://www.cs.wisc.edu/~ghost/cd.html
> >
> >I look forward to hearing from you all.
> >
> >Thanks,
> >
> >Sam
> >
> >
> > ---- you wrote: 
> >> Hi
> >> The ranking is "complex". It uses the total number of words in
> >> a file to spread out the "weight" of any given word more "evenly."
> >> This behavior did not work for me so I added a new directive
> >> called "IgnoreTotalWordCountWhenRanking". You should see t
> >> this (commented out) in your config file. Uncomment it and
> >> set it to "yes", then reindex.  This will cause the rank to be more 
> >> in line with word count. Try this and see if it helps.
> >> 	Mark
> >> 
> >> 
> >> At 03:39 PM 2/20/99 -0800, sam1600@iname.com wrote:
> >> >Hello,
> >> >
> >> >Sorry for posting what may be a blatantly newbie
> >> >comment/question ;-) but for some reason a search
> >> >on a particular keyword always returns an inaccurate
> >> >ranking.  This keyword "gmc" occurs at least twice
> >> >on every page ( once in a metatag and once in a link )
> >> >but more than ten times on a particular page.
> >> >( I have a search box on every page, and why
> >> >people search for this word when there is a link is
> >> >beyond me but they do anyway ;-)
> >> >
> >> >It is a small site with only a few dozen pages
> >> >and if I search for this keyword that I know
> >> >for sure occurs on a certain page more
> >> >times than other pages, the said page is
> >> >ranked far down the list.
> >> >
> >> >The command line is simple with just
> >> >the -f -w and -m options specified.
> >> >
> >> >I've read in the mailing list that the ran
> >> >algorithm takes a few things into account
> >> >when ranking but I just can't see why it
> >> >would override the total number of
> >> >occurrences of a keyword as the most important
> >> >criteria.
> >> >
> >> >I've been using Swish ( not Swish-e ) and have
> >> >been logging the visitors search keywords
> >> >( and this keyword is a popular one... hence
> >> >the reason for me testing it ).  I'm not obsessed
> >> >with this keyword ;-) i'm just curious how often the
> >> >same inacurate ranking is occurring on other words also.
> >> >
> >> >Oh, by the way.  This bad ranking is NOT occurring
> >> >with the old Swish.
> >> >
> >> >Comments anyone?
> >> >
> >> >Thanks,
> >> >
> >> >Sam
> >> >
> >> >
> >> >
> >> >----------------------------------------------------------------
> >> >Get your free email from AltaVista at http://altavista.iname.com
> >> > 
> >> 
> >
> >
> >----------------------------------------------------------------
> >Get your free email from AltaVista at http://altavista.iname.com
> > 
> 


----------------------------------------------------------------
Get your free email from AltaVista at http://altavista.iname.com
Received on Fri Feb 26 03:44:39 1999