Hi Scott,
On 28 Feb 2001, at 16:22, Scott Schultz wrote:
> Okay, I admit it. Trying to understand the Swish-E
> source code makes my head swim.
>
I am really sorry. Probably it is my fault becouse I began to work in
swish to adapt the old 1.3.2 code to my own needs and to make
index proccess much faster. Swish 1.3.X is terribly slow both in
index and search with large collections of data.
Now, Rainer Scherg, Bill Moseley and me are currently working on
it. There are many new features. The latest version is always at CVS
in www.sourceforfe.net. I Hope to release 2.2 soon.
> Does the location structure used to store the location
> of individual words? Is this where the "position"
> variable in the results list elements comes from?
>
Yep, this is the structure you are lookinf for. For each entry there is
an array of structures of this type.
typedef struct {
int metaID;
int filenum;
int structure;
int frequency; <== This is the number of occurrences of the
word in the file
int position[1]; <== This are the positions. Array from
0 to frequency-1
} LOCATION;
BTW, position contains relative position of words (word 1,2,3...) not
offsets inside the file. In fact, each field/metaname has its own
counter.
> In other words, is it possible to add some code to
> swish-e that will return the offsets of the words that were
> successfully matched? This could be used by the wrapper
> scripts to do keyword hilighting.
>
Well, we have discussed this issue before. The feature you are
proposing is is only possible for files and html pages but it is
useless for external filtered documents Eg: PDF, Database Outputs
(mySQL, Oracle...).
Another important thing to consider is that the stopwords are
removed (if you have stopwords of course) and the positions could
not match exactly the words to highlight.
Keep in mind that to search the word you have to consider the rules
(WordCharacters, etc) that you specified at index time to split in your
cgi script the docs in the same way (do not forget stemming and/or
soundex).
As you see, this is not very easy.
Anyway, every new idea and help is always welcome.
cu
Jose
Received on Thu Mar 1 10:24:00 2001