You've got the basic idea of how the index is constructed and your description
of what needs to be done to implement "nearness" sounds about right. The big
issue is as you describe it: how much will the index increase in size with
the extra data?
As far as I know no one has tried or is currently trying to add this kind
If you start in on it my only suggestion would be to use #ifdef's to
changes so the program could be compiled with and without this support. For
thing it makes it easier to see where you made changes and it also lets people
who are not willing to accept the larger index sizes compile it the way
they want it.
At 11:53 AM 6/10/99 -0700, Scott Schultz wrote:
>Roy Tennant sez:
>>There are no proximity operators or phrase searching capabilities in
>Are there any plans to include such functionality in the
>I've taken a look at the code and I think I have a basic understanding
>of how the indices are created. (Maybe someone in the core team will
>get inspired to someday document what all those lists are doing. *heh*)
>It looks like the index keeps a file ID for each file. For each word,
>there's a list. The list elements consist of two pieces of data: A file
>ID and the count of the number of times the word appears in that file.
>There's more to it, of course, but that's the basic index.
>It looks like implementing a "near" operator would involve creating
>a new list associated with each word, which would list the locations
>of the word in each file. This list would have multiple entries per
>file since the word would likely occur many times in each file.
>Swish-e would most likely take some parameter to specify how near
>(1 - 3 words, maybe) so the search function would use the location
>information to decide whether to register a hit and how to weight it.
>The location information could also be returned to the application if
>desired so that the final output could show the keywords highlighted
>in some way.
>This is off the top of my head so there's probably a better way to
>do it. The immediate implication is that the index file size would
>balloon dramatically. The percentage of increase would depend a lot on
>the type of documents you index. In my case, my documents are rather
>small without a great deal of repetition, so I've been psyching myself
>up to dive in and give it a try. I think that the index could be
>optimized in such way that non-near searches could ignore the
>location info and and run at pretty much their current level of
>However, if the core team was already working on this, I'd be more
>than happy to defer to someone who knows what all the existing code
>is doing. :^)
>In any case, I'd be interested in some feedback about implementing
>this functionality. I doubt that I'm the first person to think
Received on Thu Jun 10 12:13:53 1999