Skip to main content.
home | support | download

Back to List Archive

MetaNamesRank (was: Multiple property values)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Dec 02 2002 - 06:40:18 GMT
At 06:17 AM 11/28/02 -0800, William Bailey wrote:
>Also how is the MetaNamesRank directive going? This sounds interesting and 
>could well turn out to be a worthwhile feature.

Do you have idea how you would like to see this implemented?

Currently, ranking is rather simplistic.  I don't think that's a huge
problem, considering that swish-e is not often used for indexing tens of
million or more docs.  In fact, most tweaks to the ranking code I've tried
in testing don't make much difference in over all ranking of results.

Currently, a file's rank (for a single word) is basically the log of word
frequency (how many times that word is found in a document) scaled to 1000.
The word frequency is biased by the "structure" which is a flag that says
where in the html the word is found (e.g. in <title> or <h1>).  For
example, a word in the title might ranked as equal to, say, five
occurrences in the body.

I was thinking of doing a similar adjustment for metanames, although I have
not thought about how to combine a meta bias with the structure bias --
which can happen if you use "fake" html tags as meta tags in HTML2 docs:

   <metafoo><h1>word</h1></metafoo>

Anyway, this might be a good time to discuss the ranking code.  Comments
welcome!

Swish calculates the rank word-by-word.  That is if you search for "foo
bar" then swish calculates the rank for "foo" and then for "bar" and then
combines the rank (running average for ANDs, and simple addition for ORs,
IIRC).

When calculating the rank for a single word swish knows the word
position(s) in the source file, the metaname that the word is in, and the
"structure" which, again, is a flag saying if it's in <title> or <em> or
<h1> and so on.

The word position is not absolute -- it gets modified (bumped) by some tags
(to prevent phrase matches across metanames, for example).  But it could
also be used to bias the rank.  For example, maybe words that are in the
first 100 words of the document should be ranked higher.

Like I've said before -- code & comments welcome.

-- 
Bill Moseley
mailto:moseley@hank.org
Received on Mon Dec 2 06:40:43 2002