Here are some very interesting words from Bill Moseley.
Any more comments will be appreciated.
> I had a patch in swish to stem wild card words before expanding with
> expandstar. (It was the other way around in older versions.) It looks
> like you have reorganized the way that works, and wild card searches are
> not working as they were in my patched version.
> For example, with stemming enabled, searching for "run" "runs" or "running"
> should all stem to "run" and find the same results, and they do.
> It is debatable what should happen when mixing wild cards and stemming.
> For example, searching for "runn*" won't find "running" because "running"
> is stored in the index as "run" and doesn't match "runn".
You are totally right.
swish-e-1.x works as follows:
1- expandstar translates run* into "runa or runb or run ..."
2- for each word:
. Stem word
. get word results
3- "or" of all results
4- show results
The "3" line can be terrible if you just put "r*"
The problem for this is performance. If you look for "r*" you
can see a slow response.
swish-e-2.0 works as follows:
1- stem in run* -> need to check it. Probably it will not work with the
2- get results for run* in just one call to file index
3- show results
As you can see, this is much more efficient proccess. Just try to search
for r* and you will experience it.
> I guess I would argue, though, that searching for "running" and searching
> for "running*" (or "runs" and "runs*") should return the same results. So
> that's why I had the patch to stem words before expanding with expandstar.
> So searching for "running*" would get stemmed to search for "run*" which
> would find all the "run" words in the index, as expected.
> Would it be difficult to make the new version also stem before expanding
> the wild card search?
It is very easy. I can add something like:
if (applyStemmingRules || applySoundexRules)
but this will give you bad performance once again (only when Stemming
> One other question: In searching you now split up words by WordCharacters.
> Just so I understand, do you merge the WordCharacters from each index file
> into one set of characters? That is, you don't process the search terms
> once per index file, but rather once for the entire search using merged
> WordCharacters and other settings?
Yes, I merge all the WordCharacters and other settings. I do the same
merging index files.
Received on Tue Jun 27 11:47:29 2000