Skip to main content.
home | support | download

Back to List Archive

Re: New version swish-e-1.3.2-PHRASEi

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon May 08 2000 - 23:27:34 GMT
At 10:45 AM 05/08/00 -0700, Ron Samuel Klatchko wrote:
>
>Are you sure about this?  That means if a user searches for the phrase
>"in document" you'll turn up this entry even though the actual phrase is
>"in a document."

Or "in [any stop word(s)] document"

>Is it possible to detect all stop words at search time?  You could then
>code up search for the phrase "in a document" to find the word "in" at
>position X, no word at position X+1, and the word "document" at X+2.  I
>admit it still wouldn't be perfect since you could not differentiate
>between "in a document" and "in the document" but it seems to match
>expectations (at least my expectations) better.

So that would fail to find "in document" as you point out, but would find
"in [any stop word] document", correct?  I'm not sure how much better that
is, as it only prevents that one case of "in a document" finding "in
document", but not the case of "in a document" finding "in the document".

The problem is how do you lookup "no word" in the index?

You can look up the word "in" and find it's in a document test.html,
position 5, and then find "document" also in test.html in position 7.  But
then you would have to somehow lookup in document test.html what was at
position 6 to see if it was a stop word (and thus position 6 is "no word").
 The index doesn't hold that information. 

I suppose swish could handle stops words uniquely, and keep an additional
index by file name that lists stop words and their positions, then you
could lookup such information by file name to see if a stop word was in
that position (and even if it matches the exact stop word in the query).

I think it's easier to just say stop words just don't exist, and then keep
their use to a minimum.




Bill Moseley
mailto:moseley@hank.org
Received on Mon May 8 19:29:29 2000