Hello Peter:
Q: Are you saying that in SWISH-E right now, each word
is weighted as equally important?
Also, thought you might find this interesting, I found
it on the web today:
http://www.phpconference.de/2003/slides/database_track/golubchik_mysql_fulltext_search_2003.pdf
Page 16 discusses "relevance ranking" where the author
notes "there are many different theories proposing
estimation of the word weight in a document", and
shows one.
Thirdly, a metric I've used successfully outside of
swish-e is:
weight(word) =
-log( num occurances of word / (num of times most
common word appeared * 1.3) );
It ain't pretty, but it with my normal data it yields
words weights between 0.262364 for the most common
word (which appears 3648 times), to 8.464299 for words
that appear once.
Best,
jrobinson
At 5:16 AM -0700 4/19/04, Peter Karman wrote:
Peter Karman wrote on 4/16/04 4:38 PM:
> 2. RelativeFrequencyBias *percent* *bias* *max*
>
> I know, this seems like newer new math (and my math
was never
> outstanding; I'm a literary critic by training...).
But consider this
> example and please tell me where my logic is wrong:
I must have been high when I wrote this last Friday.
My example was just
totally wrong.
What I want to do in ranking is account for the
relative frequency of
each query word in the total found set. Then apply
something like audio
compression (not like mp3, but more like analog
compression in
recording), where all the softer sounds are brought up
to a threshold
and all the louder sounds are tapered off at a max,
thereby reducing the
sonic range to within a min and max.
Example:
a search for 'the foo' turns up 100 hits. 'the'
appears a total of 1000
times in those 100 hits. 'foo' appears a total of
'150' times. My
assumption is that 'foo' is a more important word than
'the', based on
those numbers.
If we use this formula:
f_bias = max_freq / freq
then the f_bias for 'foo' would be:
6.67 = 1000 / 150
In rank.c right now, each word's raw rank per doc is
calculated based on
structure (context where it appears) and any
MetaNamesRank value.
rank += sw->structure_map[ GET_STRUCTURE(posdata[i]) ]
+ meta_bias;
What I'm proposing is this:
rank += ( sw->structure_map[ GET_STRUCTURE(posdata[i])
] + meta_bias )
* f_bias
In my example, if a doc had 'foo' 10 times in a
structure worth 9
points, for a normal rank of 90, it's rank would jump
by 600+ points (10
* 9 * 6.67). This makes sense to be, because in our
example, this
particular doc has 15% of the total occurances of
'foo' in it, making it
a pretty 'relevant' doc.
This lets docs with less common words rise faster in
the rankings than
docs with equal instances of more common words.
What do you think?
pek
--
Peter Karman - Software Publications Engineer - Cray
Inc
phone: 651-605-9009 - mailto:karman@cray.com
__________________________________
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25¢
http://photos.yahoo.com/ph/print_splash
Received on Wed Apr 21 07:22:56 2004