Re: [SWISH-E:436] Re: Indexing/Searching for Plurals

From: Mark Gaulin <gaulin(at)>
Date: Wed Aug 12 1998 - 00:28:46 GMT
I am playing around with a Stem() function from WAIS, and it works
well (for what it is).  I think that stemming has to be applied at
index time so that the words in the index are properly stemmed
(actually "de-stemmed", leaving only the root word).  Given that the
index contains only root words, the results of a search where the
search terms themselves were not de-stemmed would be lousy.
The words in the index and words that are searched both have to 
be de-stemmed.  If you put a checkbox like the one shown below
(asking if stemming is requested) then you should maintain two 
indexes; one with stemming applied and one without. The version
of Swish-E that I am building enforces the rule that any search terms
applied to a de-stemmed index must themselves be destemmed.
(But an index can be created that is not de-stemmed, and the search
terms applied to it are left alone. Should de-stemming search terms,
regardless of the index's status be allowed? Perhaps...)

Using wildcard searches seems like a partial solution, since there
are two sets of words that are of interest: the words the user wants
to find and the words that the author(s) of the corpus happened to
use. If I could be sure that all of my documents used "motor"
and not "motors" then there would be less of a problem. Since that
is not the case I want to have more control.  (This is a weak argument,
I know. Basically, automatic de-stemming is just "easier" to use,
in my opinion.)

I have a working NT version of Swish-E that has the Stem function and also
the "document property" thing. It is a work in progress, but if anyone
wants to try it, let me know and I'll send it along.


At 04:14 PM 8/11/98 -0700, you wrote:
>On Tue, 11 Aug 1998, Paul J. Lucas wrote:
>> 	And I don't see why a one line instruction such as:
>> 		Use * after a word for wildcards, e.g.
>> 		"librar*" to match any one of "library,"
>> 		"librarian, " or "libraries."
>> 	isn't understandable even by Joe Sixpack.
>	If you make stemming optional an *index* time, then the user
>	doesn't have a choice and I don't like that.  If the *search*
>	component is capable of either stemming or not, then you need
>	to add a checkbox to your HTML search form:
>		[_] Perform stemming
>	but then you have to explain what stemming is.  My point is
>	that either you explain how to use wildcards as I did above
>	-or- you have to explain what stemming is and give the user the
>	ability to turn it off.
>	Moral: there's no such thing as a free lunch.
>	- Paul
