Skip to main content.
home | support | download

Back to List Archive

Re: New version swish-e-1.3.2-PHRASEi

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Mon May 08 2000 - 16:19:52 GMT
Hi, Bill

Bill Moseley wrote:
> 
> At 05:10 AM 05/05/00 -0700, Jose Manuel Ruiz wrote:
> >At last there is a new version: swish-e-1.3.2-PHRASEi.
> 
> >- Wildcard search totally rewritten. expandstar and getmatchword
> >removed.
> >All the logic now is implemented in getfileinfo. So, the search is
> >faster because is computed only once.
> 
> I'd encourage people to try this version.  Indexing seems almost twice as
> fast now, and searching is really quick, too.
> 

Well, there are only a few requests of the code in our server log...

Anyway, let's go to the important issues...
Indexing is faster but there is something that it is not 
yet implemented: Recalculate word position when automatic stopwords 
(IgnoreLimit option in config file) are found. 

This is how swish-e works:
Al the words are extracted from the files and, when finished, automatic
words are removed in removestops function (index.c).
Well, this is OK for old swish-e (1.3.2). But if you are using PHRASE
version it is necessary to recalculate the positions of all the
words to decrease the counter when an automatic stopwords
precedes any valid word.

For example, a document containing:

this is a phrase in a document

get the following word positions:

this: 1
is: 2
a: 3 6
phrase: 4
in: 5
document: 7

Affter procesing automatic stopwords, word "a" is removed
and the positions remain as follows:

this: 1
is: 2
phrase: 4
in: 5
document: 7

But they should be:

this: 1
is: 2
phrase: 3
in: 4
document: 5

Now, I am coding to fix it. What I am doing is a global 
recalculation of word positions if automatic stopwords are found. 
It is made entirely in memory (all the words are stored in memory
while indexing) but this is a terrible CPU eater process
that can slowdown indexing (For each stopword all
word position must be fixed). From my very first update of the
code until now, the code is getting faster but I recomend
using IgnoreWords instead of IngnoreLimit if you want 
good indexing performace. This issue does not affect to search.
The fix will be available in next update.

> >- Added more info to the index header to use it in the future:
> >Wordchars,
> >Beginchars, etc.
> 
> Great!  This will be useful when implemented.

I will implement it in next update.

> 
> The phrase search still seems to bump the word counter when stop words are
> included in a search.  Is that an easy fix?
> 

As you have read in previous lines stopwords are a terrible headache.
I hope next update will fix the problems. 

> Now, if I could just figure out a good way to do search term highlighting
> in phrase mode.  I'm having a hard time finding a way to do it that won't
> require a lot of post-processing.
> 

Well, perhaps in a near future (version). Now I will try to achieve a
stable
version of phrase search but, as you have read, only a few requests of
the code
have been made.

Have a nice day 

Jose Ruiz

jmruiz@boe.es
Received on Mon May 8 12:21:48 2000