Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:430] Re: Indexing/Searching for Plurals

From: Mark Gaulin <gaulin(at)>
Date: Tue Aug 11 1998 - 17:18:51 GMT
I was just looking at source code for WAIS and they have a "Stem()" 
function based on the "Porter" algorithm, which I assume is for English
	Porter, M.F., "An Algorithm For Suffix Stripping,"
	Program 14 (3), July 1980, pp. 130-137.

Since swish does not have anything like this I'll take a look and see
how compatible the two algorithms are.  If it works then the thing to
do would be to allow for *optional* stemming during the index and
search processes.

At 09:19 AM 8/11/98 -0700, you wrote:
>At 08:48 AM 8/11/98 -0700, Paul Lucas wrote:
>>On Tue, 11 Aug 1998, Mark Gaulin wrote:
>>> Is there a good way to handle indexing & searching for plurals?
>>> I would like "motors" and "motor" to be the same. Tips?
>>	This process is called "stemming": to find the "stem" of a word
>>	and index based on that.  If the speed of performing this
>>	process by the Excite search engine is typical, it's a VERY
>>	slow process.  You also need lots of data (stemming tables) that
>>	know about the human language you are stemming:
>>		houses -> house
>>		housing -> house
>>		teeth -> tooth
>>		...
>>	Personally, I don't like search engines I use to do stemming at
>>	all.  I suppose Joe Sixpack might like them since he isn't used
>>	to thinking about things in the precise manner of programmer
>>	types and he expects computers to be "smart"; however, he often
>>	gets far more documents returned than he knows what to do with.
>>	In contrast, when programmer types enter queries, they are
>>	precise.  For example, if I'm trying to find a document that
>>	really only has the word "house" in it (and not "houses") then,
>>	when I enter "house" that's what I *really* want the search
>>	engine to look for and no more: if I wanted "house" or "houses"
>>	then that's what I would have entered.
>>	- Paul
>Hi Paul,
>I think this is the focal point of one of the largest problems in search
>engine design.
>Most folks who use search engines aren't programmers.  (They aren't Joe
>Sixpack either, but that's a different story).
>Since computers are designed to be intelligent machines, it is reasonable
>to expect them to be able to do things like stemming *if you want them to*.
>Thus, most database software that is designed for general use has the
>option of turning stemming on or off.
>SWISH is one of the easier search engines to set up, so it tends to get
>installed in lots of places where the general public is expected to use it.
>Unfortunately, the general public is not sophisticated enough (and probably
>never will be) to understand the problems that can arise with SWISH.
>Programmers who want their systems to be useful need to understand these
>foibles, and to understand how to use the intelligence of these powerful
>machines to compensate for them.
>I believe that the ability to write user friendly software is the true mark
>of expertise in programming.
Received on Tue Aug 11 10:28:30 1998