Skip to main content.
home | support | download

Back to List Archive

Re: MetaNamesRank

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Mar 18 2002 - 23:00:29 GMT
At 02:20 PM 03/18/02 -0800, Bob Stewart wrote:
>At 01:45 PM 3/18/02 -0800, you wrote:
>#define RANK_TITLE		4
>#define RANK_HEADER		3
>#define RANK_META		3

>
>I'm not sure I understand how they work. 
>Does "RANK_HEADER" refer to anything that appears eithin the <HEAD> tags?

No, a quick trip with grep leads me to believe that RANK_HEADER (which is
set by IN_HEADER) flag is the <Hn> tag, where IN_HEAD is <head>.

>And RANK_META anything that appears in any meta tag, or only onse that are
defined in the config file with the "MetaNames" line?

Maybe.

I have to figure this stuff out each time.  

It kind of depends on which HTML parser is used.

> cat foo
<meta name="foo" content="hi">

> ./swish-e -i foo -T indexed_words -v0
Indexing Data Source: "File-System"
    Adding:[1:swishdefault(1)]   'hi'   Pos:2  Stuct:0x81 ( META FILE )
Indexing done!

So yes, that's marked as IN_META.  Now, let's use the libxml2 parser:

> cat c
DefaultContents HTML2

> ./swish-e -i foo -T indexed_words -v0 -c c
Indexing Data Source: "File-System"
    Adding:[1:swishdefault(1)]   'hi'   Pos:3  Stuct:0x5 ( HEAD FILE )
Indexing done!

Ok, that changed two things.  First, liblxml2 is smart enough to place the
<meta> tag into an implicit <head> section.  But it also sees that "foo" is
not a defined metaname so it's not setting the META flag.  That's by
design, but I'm not sure if that's also a bug. ;)  I don't think it's
documented one way or another.  Suggestions?

> cat c
DefaultContents HTML2
metanames foo

> ./swish-e -i foo -T indexed_words -v0 -c c
Indexing Data Source: "File-System"
    Adding:[1:foo(10)]   'hi'   Pos:2  Stuct:0x85 ( META HEAD FILE )
Indexing done!

Yep, adding "foo" as a metaname changes that.


>So a word within the <Title> tag would have:
>1 (base rank) + 4 (within the title tag) + 3  (within the head tag) = 8

Something like that.  ( A few printf's in rank.c would be the way to go. )

Again, this has been discussed.  Not too long ago, the "structure" (which
holds the IN_TITLE, IN_META, IN_HEAD,... flags) was a *single* bitmask used
for *all* occurrences of a work in a document.  So if "foo" was IN_TITLE,
but also twenty times in the body, then ALL 21 were considered IN_TITLE.
That results in that word having a lot higher rank that it really deserves.
 In most cases that over stated rank for that word is probably just fine,
but it's still not accurate.

Jose, not too long ago modified the index structure to keep the "structure"
bitmask per word position.  That increased the size of the index by small
amount (the data is compressed), but it means that more accurate rank can
be calculated.

At this time, though, that extra data is not used, and rank.c just combines
(OR's) the bitmasks for all the word positions for a give word together and
calculates the rank like it did before.

My opinion is that the word's rank can be calculated during indexing
instead of during searching and then that extra structure bitmask won't
need to be stored in the index.  (You can look at rank.c, which is called
while searching) and see that it's working with data that's known while
indexing.)

Of course, I also think that ranking should be overhauled.  I think that
would be a great project for someone (perhaps a grad student?).

But, once again, I seem to have written more than I intended.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Mon Mar 18 23:01:58 2002