At 08:52 AM 08/30/02 -0700, David VanHook wrote:
>The folks on the business side of things are asking me for an explanation of
>how the "rank" is generated for each query. I've looked at the rank.c code,
>but unfortunately my C skills are not sufficient for me to make much
I think the rank.c code will make sense once you understand the data. I
think I can figure out how it works, but not why it works like it does ;)
It's not a complicated system -- swish-e isn't indexing Googles of docs
That said, anyone wanting to improve it is MORE than welcome to help.
Basically, the rank of a single hit word is the log base e of the number of
times the word is found in the doc (it's frequency). I suppose log(e) is
to avoid making docs with a huge number of words rank way higher than those
with just a few.
Then the rank is multiplied by a factor, depending on where in the document
the word is found (e.g. in <title> or <h1>).
Each word in the index has data that describes the word's position (which
is used for phrase matching) and also a byte called the word's "structure"
which says where in a HTML document the word was found. This is used to
bias a words rank value.
In config.h is a set of defines that set the relative values of the words:
#define RANK_TITLE 4 // <title>
#define RANK_HEADER 3 // <h1>..<h*>
#define RANK_META 3 // <meta>
#define RANK_COMMENTS 1 // <!-- comment -->
#define RANK_EMPHASIZED 0 // <em> <strong> <b> <i>
In the code you can see that all words start with a "factor" value of one,
but if a word is, say, in a title, then the value of four is added and the
factor becomes five.
So the rank for a given word is just
rank = factor * log(e) word_count_in_doc
Now, if you set IgnoreTotalWordCountWhenRanking to "no" then the rank is
adjusted slightly based on the total number of words in the file -- I guess
the thinking goes that if foo is in a document of 20 words it should rank
higher than in a doc of 1000 words. Check the archives as there had been
discussion about that feature. It's off by default.
One problem with the above is how the factor is calculated. Originally
there was *one* "structure" bitmask for the list of words in a document.
So, if "foo" was in the title, and also in the body five times, then it
really looked like "foo" was in the title five times.
Jose changed the index format to store the structure per word position to
fix that problem, but now the problem is in rank.c in that the structure
bits for all the word positions are merged together to make the factor.
Need to decide how to rank the words individually and then combine them
into a single rank for that word hit. I suppose what should be done is
scale up the frequency based on where the word is located and not use the
"factor" at all. That is, a title word should be considered as, say, five
or ten body words.
You can see how low-tech this system is. It would be rotten if you are
indexing tens of millions of documents, but for thousands or even tens of
thousands it's probably ok. If "foo" is listed in a document many times is
the document really about "foo"?
Another thing to note about all this is that each word's rank (as it is
now) could be calculated during indexing instead of during searches. This
could speed up searching slightly and maybe reduce the index size a bit.
My guess, though, is that to improve ranking at some time in the future
that we would want all that data during searching.
A running average is used to calculate a document's rank for words that are
ANDed together (see search.c), and ORed results are have their ranks added
together and multiplied by two with the idea that if you say "this OR that"
that words with "this AND that" should rank higher.
Now, another problem is that phrase searches don't really rank correctly.
Hopefully phrase searches result in small result sets where ranking is less
critical, but if you search for "this phrase" it's still ranking as two
word hits where words that are not in the phrase count -- that is if two
documents have "this phrase" each once, but one document has "phrase" in
the title or many times in the body then its rank will be higher. That's
because ranking is based on individual word hits, without knowledge of if
that word is in a phrase or not.
I have spent some time trying different things with the ranking code and
for small sets of files (typical for swish) I have not see huge
differences. I would love to find someone to rewrite the ranking code --
great graduate student project.
Received on Fri Aug 30 19:36:47 2002