Skip to main content.
home | support | download

Back to List Archive

Re: User ranking tags.

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Apr 21 2003 - 21:23:49 GMT
On Mon, 21 Apr 2003, Douglas Smith wrote:

> I hope this isn't just noise on this mailing list...  I should
> start going after the code and see there could be some way to
> put this in.  I mean you have to get info about the page out
> of the index anyway, not just the properties, at least the 
> url or filename.  When this is retrieved couldn't a rank factor
> be retieved also?

The url/filename is just another property.  

This would likely require another table indexed by file number.  To scale
all the results you would need to lookup the file's rank bias for each
result, so you really need the entire table in memory to be fast.  Using
the property table for this would be slow -- better to have a byte sized
table.

How would people flag documents as being more important than others?  

I'd be worried that just marking a document as important could make
searching worse.  For example pages that are marked important but just
mention some other topic might rank higher than pages that are about the
topic you are searching.

What I think is better is the old <meta name="keywords"> method.  The
page's author tells swish that hits in the keyword's meta should rank very
high.  That's what the RankBias setting is for.

Here's a little overview of the current ranking:

Raw word ranks are assembled (I don't want to say calculated) in rank.c.
In search.c they are combined when ANDing or ORing results.  
result_sort.c runs through all the results so while it's doing that it
finds the largest rank number "bigrank" and calulates a scaling factor for
use when displaying the rank in docprop.c.

Right now there's a few variables that are combined to form the raw rank.
I'm not at all clear how well they are combined.

The ranking system is very primitive and could use a redsign.

Basically, a word's rank in a document is the log() of the sum of the
number of times the word is found in the document.  Actually, the
structure of each word effects each word's rank.  In config.h I currently
have on my machine:

#define RANK_TITLE      7
#define RANK_HEADER     5
#define RANK_META       3
#define RANK_COMMENTS   1
#define RANK_EMPHASIZED 0

So a normal word has a point value of one, but a word in the title has a
value of seven (plus one).  In addition to that, the meta name rank bias
is added (or subtracted) from each of those.

Here's the basic code:

   for(i = 0; i < freq; i++)
        rank += sw->structure_map[ GET_STRUCTURE(posdata[i]) ] + meta_bias;

where the strucutre map is just mapping the above #defines e.g.
foo's word value in <h1><em>foo</em></h1> is 

  RANK_TITLE+RANK_EMPHASIZED + 1 + meta_bias

(all words have a starting value of one).

Again, the log() is taken of the total rank, so that a large word
frequency dosn't have such an effect.  That still doesn't work well when
there's some very large documents indexed, so on my machine I've just
limited "freq" to a small number (like 100) so only the first 100 words
are considered.  That actually helped quite a bit.

There's code that is suppose to take into consideration the size of the
document (IgnoreTotalWordsWhenRanking), but I have not found that it
helps.

I do think that trying to adjust ranking in multi-word searches by how
close the words are together would be a good thing.





-- 
Bill Moseley moseley@hank.org
Received on Mon Apr 21 21:24:36 2003