Bill Moseley wrote on 8/4/04 1:03 PM:
> On Wed, Aug 04, 2004 at 04:08:09AM -0700, Peter Karman wrote:
>
>>For example, if the word 'the' appears in 98% of the docs in your index,
>>it will have an IDF of 1. If the word 'foo' appears in 10% of your docs,
>>it will have an IDF of something greater than 1 (something like 5 or 6,
>>depending on the math, number of docs, etc.). So for a query of 'the
>>foo', docs with more instances of 'foo' will rank relatively higher than
>>docs with fewer instances of 'foo', while instances of 'the' will affect
>>ranking much the same way they do now (that is to say, not much).
>
>
> Does this effect this config option?
>
> IgnoreTotalWordCountWhenRanking
>
Not for this IDF feature. Ideally, total word count would be used to
calculate word density and to normalize a document for length (really,
number of words). So IgnoreTotalWordCountWhenRanking would need to be
set to 0 for that to work (or word count stored in the index no matter
what but ignored for ranking -- it seems not to be stored in the index
at all if IgnoreTotalWordCountWhenRanking is set to 1). I've been
fooling with that but haven't had time to really test the results.
>
>>IDF has a similar effect to IgnoreLimit or StopWords, but on a smoother
>>scale. A word isn't just in or out (a StopWord or not), but rather has a
>>relative weight compared to all the other word in the index.
>
>
> But, that's not implemented, right? So is the idea that stopwords
> just have a much lower score?
Sorry, I don't follow. What's not implemented? The 'smooth' effect
should be felt with the new IDF feature.
But yes, you're right about the idea: stopwords are still counted but
just have a much lower score. That allows you to still find exact
phrases like "the foo" as opposed to "a foo" but rank/weight is adjusted
per word.
>
>
>
>>I have several other new ranking features in the works, but wanted to
>>get some feedback for this one before I move ahead too much in this
>>direction. Other features might include:
>>
>> normalizing weight for word density/document length
>> scaling the IDF to allow for greater granularity in difference
>> weighting words based on their proximity to other query words
>
>
> That last one would be nice -- if that worked well then the default
> search might be "OR", but the "ANDed" results get ranked much higher.
>
Yes. Though I think that a true vector ranking scheme would have to be
used instead of the current AND/OR system. I don't mean you couldn't
have both (vector ranking and boolean AND/OR) but as I understand it,
vector ranking uses document *similarity* to a query. It's much fuzzier
than the strict AND/OR boolean system swish currently uses. I guess it's
more like OR, with the effect you describe: ANDed results rank higher.
Vector ranking claims to mimic natural language query better than
boolean does.
More plainly, a vector-rank search would look like:
find all docs that any of the query words (i.e., an OR search)
within that subset, calculate a vector for each doc
calculate a vector for the query words
compare the query vector with each doc vector and return only those docs
similar enough to merit inclusion (< threshold).
So I imagine config settings like:
UseVectorRanking 0|1
VectorThreshold *integer*
where a threshold of 0 returns everything that matches the OR search.
That would be most useful for purely HTML text searches. Anyone using
swish to index XML and/or database output would likely want:
UseVectorRanking 0
and instead rely on the strict boolean AND/OR swish currently uses.
UseVectorRanking should probably default to 0.
I need to get some of the other stuff working better before I tackle the
whole vector deal, though. That's months off.
--
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Wed Aug 4 12:15:16 2004