Re: Proximity Searching, Stemming

From: Bill Moseley <moseley(at)>
Date: Fri Jul 09 2004 - 18:18:22 GMT
On Fri, Jul 09, 2004 at 08:43:45AM -0700, Tac wrote:
> Does swish-e support proximity searching, so that you can find words when
> they're within a few word of each other?

No.  Jose added phrase searching some years (years?!) back which
checks for position.last == + 1 kind of thing.  So I
suspect a "NEAR" operator would not be too hard to add.  It's been
discussed quite a bit in the past.

> e.g.  "smoking ban" w/5 airport
> would find "airport smoking ban" and "smoking ban in airports".  If so, that
> would mean that the word offsets were somehow stored, so the next question
> would be: "could we get those word offsets?"  I realize that stemming
> happens at indexing, not searching, time, so when a document comes back, we
> really don't know what word(s) matched.  This makes highlighting difficult.
> My idea is that if we had access to the word offsets, we'd know which words
> were matched.

Too bad this stuff wasn't designed from the ground up to do
highlighting.  The position numbers that swish-e uses is probably of
not much use for finding words in the original document.  Swish would
have to maintain a new and separate database of all the words in a
document so you could ask swish for what word was at file N, Meta M,
Position P.  And then you couldn't really reconstruct the text as you
need stopwords and non-wordcharacters chars.

The other thing is if you do a search for the word, say, "it" then
it's one thing for swish to tell you what docs that can be found in X
documents, but if you also want to know what word positions then you
may be getting a lot of data back.  Throw in phrases highlighting gets
more complex.

I suspect the lack of features that swish provides is somewhat
responsible for the speed up you are seeing. ;)

Bill Moseley

Received on Fri Jul 9 11:18:45 2004