On Mon, Mar 08, 2004 at 06:12:53PM -0800, Dave Moreau wrote:
> My understanding based on results is that swish-e does not discriminate
> between words. Word frequency in a document is used to compute rank, but the
> word's frquency in the overall document set is not considered.
Right. Just by the frequency of the word in a given document.
> I just remember being taught that the weight of a word in the rank
> should be inversly proportional to the number of documents it appears
> in. This would cause the word 'the' to be of less weight than the word
> 'democracy', even if (in most document sets) 'the' appears in the
> title and 'democracy' only in the body.
All swish has is the IgnoreTotalWordCountWhenRanking option that seems
to bias the rank based on how many words are in the document. That code
pre-dates my involvement in swish so I'm not sure of the reasoning or
implemenation (but I did include it in the last rewrite of rank.c).
It's off by default and as I mentioned in my last message in my tests it
seem to make ranking worse.
It's been a while since I looked at how the word data is stored (Jose
knows better), but in addentry() in index.c I would think you could
store the total word count for all documents (within a given metaname)
and use that to weight the words.
Still, getting a final rank is complex -- you have phrases, word
positions (should words early in a document weigh more?), metanames,
word proximity (should rank on multi-words queries be adjusted based on
how close words are together?) and structure (bold, headings, title). I
still think swish-e's rank would make a good graduate project for
> Was disciminating among terms considered for swish-e and considered to be
> too much additional work, or was it not included cuz it's a bad idea?
> Or did did the issue never come up?
It comes up a lot.
> It seems it would give more relevant results.
In some cases, probalby yes.
Received on Tue Mar 9 06:16:44 2004