Peter Karman wrote on 4/16/04 4:38 PM:
> 2. RelativeFrequencyBias *percent* *bias* *max*
>
> I know, this seems like newer new math (and my math was never
> outstanding; I'm a literary critic by training...). But consider this
> example and please tell me where my logic is wrong:
I must have been high when I wrote this last Friday. My example was just
totally wrong.
What I want to do in ranking is account for the relative frequency of
each query word in the total found set. Then apply something like audio
compression (not like mp3, but more like analog compression in
recording), where all the softer sounds are brought up to a threshold
and all the louder sounds are tapered off at a max, thereby reducing the
sonic range to within a min and max.
Example:
a search for 'the foo' turns up 100 hits. 'the' appears a total of 1000
times in those 100 hits. 'foo' appears a total of '150' times. My
assumption is that 'foo' is a more important word than 'the', based on
those numbers.
If we use this formula:
f_bias = max_freq / freq
then the f_bias for 'foo' would be:
6.67 = 1000 / 150
In rank.c right now, each word's raw rank per doc is calculated based on
structure (context where it appears) and any MetaNamesRank value.
rank += sw->structure_map[ GET_STRUCTURE(posdata[i]) ] + meta_bias;
What I'm proposing is this:
rank += ( sw->structure_map[ GET_STRUCTURE(posdata[i]) ] + meta_bias )
* f_bias
In my example, if a doc had 'foo' 10 times in a structure worth 9
points, for a normal rank of 90, it's rank would jump by 600+ points (10
* 9 * 6.67). This makes sense to be, because in our example, this
particular doc has 15% of the total occurances of 'foo' in it, making it
a pretty 'relevant' doc.
This lets docs with less common words rise faster in the rankings than
docs with equal instances of more common words.
What do you think?
pek
--
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Mon Apr 19 05:17:43 2004