At 02:03 PM 02/27/02 -0800, Alex Lyons wrote:
>If I may contribute to the debate...
>Bill suggests something like "-w *=foo" meaning search in all
>metanames, but also suggests "-w swishdefault,keywords,othermeta=foo" as
>meaning search in only the specified metas. At the moment it seems that
>the simple "-w foo" would map to the proposed "-w swishdefault=foo".
That's the way it currently works. swishdefault is always metaID number
one, and is what is used if no metaname is specified in the query. In CGI
scripts I write where the user can select a metaname to search I just use
"swishdefault" and always make the query -w $metaname=($query)
>If this proposal is implemented, could I suggest that, instead, "-w foo"
>should map to "-w *=foo", which seems to be what Fred was originally
That's a big change to current behavior, though. It makes sense, but it
seems like too big a change. Could be a config.h setting, I suppose.
>If all this is done, we don't really need the "-t HBthec" flags any
>more, as "-t t -w foo" can be expressed more generally as "-w title=foo"
>and so on (the h and e flags would correspond to metaname aliases or
>groups). No doubt the -t would have to remain for a while for backwards
I wonder how many people use the "-t HBthec" flags. That flag that tracks
the "structure" of the word (title, em, h1-h3..) is also used for ranking,
so that -t feature comes along for free, really.
>Also, it could be useful to be able to assign a "weight" to words found
>in particular meta tags, like the RANK_* defines in config.h I guess
>this could be most easily done at index time through the config file: I
>wouldn't know whether the flexibility of being able to do it at search
>time (e.g: "-w title(3.0),swishdefault(1.0)=foo") would be required or
Yes, the plan is to be able to set the ranking values for both metanames
and for the "structure" in the config file. The idea is to off-load
complicated decisions to the end-user to mess up instead. ;)
I wish I had more time (and was a smarter programmer). I've been using a
few minutes here and there to look over the ranking code lately. The
current ranking code is calculated at search time, although IIRC, all the
data used to calculate a word's rank is available at indexing. So it would
make more sense to calculate a word's rank at indexing time (to make
But that would prevent changing the ranking at search time. My guess is
that's a rather advanced query feature, and I'm not sure how useful that
would be for the end-user.
Swish-e's ranking is not very fancy. But it may not need to be for the
typically small collection of files that swish deals with. There's quite a
bit of "info" on the Internet about document ranking, as you might expect.
Great dissertation fodder. I wonder if applying document vectoring or
other more advanced techniques would make any difference in real life.
What we have to work with is how many times a word is found in the
document, the "structure" of each of those words (bold, title, heading),
the position of the word, the metaID, and the total length of the document.
Now, give me an equation to put all that together.
BTW -- swish-e's current basic ranking has some fun side-effects. Say you
want to find "foo or bar" but you want to favor docs that have "foo". You
-w (foo or foo or foo) or bar
because swish simply adds up the total ranks for each of the words.
So you can actually rank something like keywords higher by:
-w keywords=(foo or foo) or description=foo
>Finally (and now totally off-subject), would it be possible to include
>some algorithm to scale the rank by some factor based on another
>(numerical) property? I'm thinking in particular of the last-modified
>property, so that newer files can be given a higher rank than older
>ones, where at the moment the search might give them equal rank. Might
>also be useful for the size property to give smaller files a greater
>rank. Some sort of "exponential decay" term (e.g: rank *=
>exp(-age/age0)*exp(-size/size0)) where age0 and size0 are specified
>either in the config or maybe even in the arglist at search-time. I
>know the CGI script can sort by numeric property once it has all the
>results, but if swish has already sorted results by rank, why do it
Seems like a good idea. And if not using the last modified date, perhaps:
AdjustRankProperty cash_paid 4.0
Received on Wed Feb 27 23:09:20 2002