Skip to main content.
home | support | download

Back to List Archive

RE: Re: Operators

From: Scott Schultz <scott(at)not-real.ceweekly.com>
Date: Thu Jun 10 1999 - 18:51:55 GMT
Roy Tennant sez:

>There are no proximity operators or phrase searching capabilities in
>SWISH-E presently.

Are there any plans to include such functionality in the 
near term? 

I've taken a look at the code and I think I have a basic understanding
of how the indices are created. (Maybe someone in the core team will
get inspired to someday document what all those lists are doing. *heh*)

It looks like the index keeps a file ID for each file. For each word, 
there's a list. The list elements consist of two pieces of data: A file 
ID and the count of the number of times the word appears in that file.
There's more to it, of course, but that's the basic index.

It looks like implementing a "near" operator would involve creating
a new list associated with each word, which would list the locations
of the word in each file. This list would have multiple entries per
file since the word would likely occur many times in each file. 
Swish-e would most likely take some parameter to specify how near
(1 - 3 words,  maybe) so the search function would use the location
information to decide whether to register a hit and how to weight it.
The location information could also be returned to the application if
desired so that the final output could show the keywords highlighted
in some way.

This is off the top of my head so there's probably a better way to
do it. The immediate implication is that the index file size would
balloon dramatically. The percentage of increase would depend a lot on
the type of documents you index. In my case, my documents are rather
small without a great deal of repetition, so I've been psyching myself
up to dive in and give it a try. I think that the index could be 
optimized in such way that non-near searches could ignore the 
location info and and run at pretty much their current level of
performance.

However, if the core team was already working on this, I'd be more
than happy to defer to someone who knows what all the existing code
is doing. :^)

In any case, I'd be interested in some feedback about implementing
this functionality. I doubt that I'm the first person to think
about it.

Regards,

Scott Schultz
scott@ceweekly.com
Received on Thu Jun 10 11:52:56 1999