Skip to main content.
home | support | download

Back to List Archive

Re: stemming and swish-2.0-beta1

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Wed Jun 28 2000 - 15:02:54 GMT
Bill,

Bill Moseley wrote:

>
> I'm trying to understand the difference with the old swish.
>
> The old swish used to take a search like "search for run*" and then lookup
> all the words in the index that started with run* and convert it int a
> search of
> "search for (run or runs or running)" and then use that for the query.  (I
> don't remember how it then handled that query, though.)
>
> Does 2.0 also first expand "search for run*" into "search for (run or runs
> or running)"?
>

No, 2.0 keeps the search as "search for run*".  No expand is made.

Let's compare both versions ...

Old swish-e works as follows

get all words  (expandstar) ---> this gives  (run or runs or running)

for each word in "run runs running"
    result =getfileinfo(word)      ---> This gives one result list per word
endfor

for all result              ----> "or" all results
   orresults(result[i],result[j])
enfor

As you can see, file index is read several times: Sequentially in expandstar to
get all the words and sequentially, for each word, once again, in getfileinfo
until it finds each word's info!!!
So, if you just have three words starting with r* ...
To read "run" info, old swish-e just reads "run"
To read "running" info, old swish-e reads: "run" and "running" (and discards
"run")
To read "runs" info, old swish-e reads: "run", "running" and "runs" (and
discards
"run" and "running").
You can see how terrible is this for performance if you search for "r*".


swish-e 2.0 just does the following:

   result=getfileinfo(run*)

So, no "orresults" is made. But in the other hand, the logic of getfileinfo is
more complex: It checks for '*'. If it is found, it gets all the words info at
once,
adding it to a unique result list!!

If there is not an '*' in the word, swish-e-2.0 uses a hash search to find the
word:

hash(word) ----> entry in index

In a normal word search (no '*'), old swish-e uses a sequential approach
after locating the start of the word using the first letter of the word. To find

running:
To get "running" info, old swish-e reads: "ra" .... "run" and "running" (and
discards
"ra" ... "run"). So in the worst scenary, if you have 400 words starting with
"r"
and the word searched is the last one, you have to read all the remaining 399
words
to get the data of just one searched word!!!

Well, it is a little bit complex but much faster!!

BTW, I will apply the fix and will make available soon the new beta.

cu
Jose
Received on Wed Jun 28 11:24:50 2000