On Wed, Aug 04, 2004 at 02:13:05PM -0500, Peter Karman wrote:
> > IgnoreTotalWordCountWhenRanking
> >
>
> Not for this IDF feature. Ideally, total word count would be used to
> calculate word density and to normalize a document for length (really,
> number of words). So IgnoreTotalWordCountWhenRanking would need to be
> set to 0 for that to work (or word count stored in the index no matter
> what but ignored for ranking -- it seems not to be stored in the index
> at all if IgnoreTotalWordCountWhenRanking is set to 1). I've been
> fooling with that but haven't had time to really test the results.
Right. I have not looked at it in quite a while, but when enabled
(not ignored) then an extra table is created and must be read while
searching.
> >>IDF has a similar effect to IgnoreLimit or StopWords, but on a smoother
> >>scale. A word isn't just in or out (a StopWord or not), but rather has a
> >>relative weight compared to all the other word in the index.
> >
> >
> >But, that's not implemented, right? So is the idea that stopwords
> >just have a much lower score?
>
> Sorry, I don't follow. What's not implemented? The 'smooth' effect
> should be felt with the new IDF feature.
I mean the part about stopwords being implemented. Currently they are
still just removed/ignored while indexing and searching. What you are
talking about is something in the future where stopwords are not
removed but just weighted much lower.
> But yes, you're right about the idea: stopwords are still counted but
> just have a much lower score. That allows you to still find exact
> phrases like "the foo" as opposed to "a foo" but rank/weight is adjusted
> per word.
I've thought about having a system where stopwords are not removed on
indexing, but ignored while searching unless the stopword is part of a
phrase.
> More plainly, a vector-rank search would look like:
>
> find all docs that any of the query words (i.e., an OR search)
> within that subset, calculate a vector for each doc
> calculate a vector for the query words
> compare the query vector with each doc vector and return only those docs
> similar enough to merit inclusion (< threshold).
How's that vector computed?
> So I imagine config settings like:
>
> UseVectorRanking 0|1
> VectorThreshold *integer*
Or maybe a switch/option used at search time?
Thanks,
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Aug 4 12:23:04 2004