Skip to main content.
home | support | download

Back to List Archive

Re: ranking ideas

From: J Robinson <jrobinson852(at)not-real.yahoo.com>
Date: Wed Apr 21 2004 - 14:22:55 GMT
Hello Peter:

Q: Are you saying that in SWISH-E right now, each word
is weighted as equally important? 

Also, thought you might find this interesting, I found
it on the web today:

http://www.phpconference.de/2003/slides/database_track/golubchik_mysql_fulltext_search_2003.pdf

Page 16 discusses "relevance ranking" where the author
notes "there are many different theories proposing
estimation of the word weight in a document", and
shows one.

Thirdly, a metric I've used successfully outside of
swish-e is:
        weight(word) = 
  -log( num occurances of word / (num of times most
common word appeared * 1.3) ); 

It ain't pretty, but it with my normal data it yields
words weights between 0.262364 for the most common
word (which appears 3648 times), to 8.464299 for words
that appear once.

Best,
 jrobinson

At 5:16 AM -0700 4/19/04, Peter Karman wrote:
Peter Karman wrote on 4/16/04 4:38 PM:

> 2. RelativeFrequencyBias *percent* *bias* *max*
> 
> I know, this seems like newer new math (and my math
was never 
> outstanding; I'm a literary critic by training...).
But consider this 
> example and please tell me where my logic is wrong:

I must have been high when I wrote this last Friday.
My example was just 
totally wrong.

What I want to do in ranking is account for the
relative frequency of 
each query word in the total found set. Then apply
something like audio 
compression (not like mp3, but more like analog
compression in 
recording), where all the softer sounds are brought up
to a threshold 
and all the louder sounds are tapered off at a max,
thereby reducing the 
sonic range to within a min and max.

Example:

a search for 'the foo' turns up 100 hits. 'the'
appears a total of 1000 
times in those 100 hits. 'foo' appears a total of
'150' times. My 
assumption is that 'foo' is a more important word than
'the', based on 
those numbers.

If we use this formula:

f_bias = max_freq / freq

then the f_bias for 'foo' would be:

6.67 = 1000 / 150

In rank.c right now, each word's raw rank per doc is
calculated based on 
structure (context where it appears) and any
MetaNamesRank value.

rank += sw->structure_map[ GET_STRUCTURE(posdata[i]) ]
+ meta_bias;

What I'm proposing is this:

rank += ( sw->structure_map[ GET_STRUCTURE(posdata[i])
] + meta_bias )
  	* f_bias

In my example, if a doc had 'foo' 10 times in a
structure worth 9 
points, for a normal rank of 90, it's rank would jump
by 600+ points (10 
* 9 * 6.67). This makes sense to be, because in our
example, this 
particular doc has 15% of the total occurances of
'foo' in it, making it 
a pretty 'relevant' doc.

This lets docs with less common words rise faster in
the rankings than 
docs with equal instances of more common words.

What do you think?

pek

-- 
Peter Karman - Software Publications Engineer - Cray
Inc
phone: 651-605-9009 - mailto:karman@cray.com




	
		
__________________________________
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25
http://photos.yahoo.com/ph/print_splash
Received on Wed Apr 21 07:22:56 2004