At 09:25 PM 10/6/99 -0500, David Norris wrote:
>You could add a printf statement to see how
>the 'word' array is being transformed in the Stem() function in
I now need a bit more help -- my C skills are very weak, and I don't follow
search.c and index.c all that well. (And thanks to David, though, I can
build swish on my PC and try a few things.)
Here's the problem from my poor reading of search.c
The word 'database' is in a file to be indexed. With stemming enabled,
Swish stems the word to 'databas' and places that word in the index (see my
previous post for -D output).
Now, searching for 'data*' expandstar() in search.c grabs all words out of
the index that start with 'data'. In this case it finds only 'databas' and
uses that as the search word. Since stemming is enabled, Swish, rightly
so, stems the search words. But in this case 'databas' stems further into
'databa', which, of course, is NOT in the index.
It's hard to know where the error is, and what should be fixed.
Stem() could be modified to continue stemming until a word will not stem.
But, in my opinion, search.c is really where there is a problem with the
The words entered in the query should be stemmed, before the expandstar()
routine, not after. And not just because of this double-stemming problem.
For example, consider a source file with the word 'runs', which Swish stems
and places in the index as 'run'. Searching for running, runs, and r*, all
E:\swish\perl\x>swish -w runs*
# Search words: runs*
err: no results
What's happening here is expandstar(), and thus getmatchword(), is trying
to find all the words that begin with 'runs' in the index to use in the
expanded search query. But 'runs' isn't in the index, its stem 'run' is in
the index. So this fails.
So, modifying search.c to stem the query words before expanding is the best
solution, and means that Stem() is called less if expandstar() generates a
large list of words to match against. (Why pull a bunch of stemmed words
out of the index, and then stem them once again?)
It would be nice to fix Stem(), too, not so much for it's failure to stem a
word completely (which probably doesn't matter), but to keep Stem() from
stemming words into nonexistence and thus leaving them out of the index.
As my C skills are lacking, can anyone help with or recommend some code
Received on Thu Oct 7 15:13:43 1999