Q: Are you saying that in SWISH-E right now, each word
is weighted as equally important?
Also, thought you might find this interesting, I found
it on the web today:
Page 16 discusses "relevance ranking" where the author
notes "there are many different theories proposing
estimation of the word weight in a document", and
Thirdly, a metric I've used successfully outside of
-log( num occurances of word / (num of times most
common word appeared * 1.3) );
It ain't pretty, but it with my normal data it yields
words weights between 0.262364 for the most common
word (which appears 3648 times), to 8.464299 for words
that appear once.
At 5:16 AM -0700 4/19/04, Peter Karman wrote:
Peter Karman wrote on 4/16/04 4:38 PM:
> 2. RelativeFrequencyBias *percent* *bias* *max*
> I know, this seems like newer new math (and my math
> outstanding; I'm a literary critic by training...).
But consider this
> example and please tell me where my logic is wrong:
I must have been high when I wrote this last Friday.
My example was just
What I want to do in ranking is account for the
relative frequency of
each query word in the total found set. Then apply
something like audio
compression (not like mp3, but more like analog
recording), where all the softer sounds are brought up
to a threshold
and all the louder sounds are tapered off at a max,
thereby reducing the
sonic range to within a min and max.
a search for 'the foo' turns up 100 hits. 'the'
appears a total of 1000
times in those 100 hits. 'foo' appears a total of
'150' times. My
assumption is that 'foo' is a more important word than
'the', based on
If we use this formula:
f_bias = max_freq / freq
then the f_bias for 'foo' would be:
6.67 = 1000 / 150
In rank.c right now, each word's raw rank per doc is
calculated based on
structure (context where it appears) and any
rank += sw->structure_map[ GET_STRUCTURE(posdata[i]) ]
What I'm proposing is this:
rank += ( sw->structure_map[ GET_STRUCTURE(posdata[i])
] + meta_bias )
In my example, if a doc had 'foo' 10 times in a
structure worth 9
points, for a normal rank of 90, it's rank would jump
by 600+ points (10
* 9 * 6.67). This makes sense to be, because in our
particular doc has 15% of the total occurances of
'foo' in it, making it
a pretty 'relevant' doc.
This lets docs with less common words rise faster in
the rankings than
docs with equal instances of more common words.
What do you think?
Peter Karman - Software Publications Engineer - Cray
phone: 651-605-9009 - mailto:firstname.lastname@example.org
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25¢
Received on Wed Apr 21 07:22:56 2004