Skip to main content.
home | support | download

Back to List Archive

Re: stemming problem

From: David L Norris <dave(at)not-real.webaugur.com>
Date: Sun Apr 14 2002 - 22:14:41 GMT
On Sun, 2002-04-14 at 13:44, Gaye Karagulle wrote:
> I have a problem about stemming algorithm in swish-e.
> I think it is not very good.

It's using the Porter algorithm.  Anyone have a better algorithm?

> For example it stems
> "verification" as "verif", and "verify" as "verifi" ,
> although they are very similar words.

Porter is a 5 step process (ignoring substeps).  Here's how each word is
handled.

Verify:
  0. verify
  1. verifi
  2. verifi
  3. verifi
  4. verifi
  5. verifi

Verification:
  0. verification
  1. verification
  2. verificate
  3. verific
  4. verif
  5. verif


> I wonder that if you work on stemming in order to make
> it better? or is there something that I can do for
> this purpose..

Implement or suggest a better algorithm.  We're certainly open to
ideas.  I do not know of anything better than Porter.

What we _could_ do is to strip "i" endings from words in step 4.  This
seems to be consistent with other word endings which end in "i".  "e" is
also removed.  But other vowels are not.  We'd have to be careful not to
stem the word into nothingness, though.  ;-)

Does anyone have a copy of the original Porter article?  There may be
some rationale behind each step.

-- 
 David Norris
  Dave's Web - http://www.webaugur.com/dave/
  Augury Net - http://augur.homeip.net/
  ICQ - 412039
Received on Sun Apr 14 22:16:16 2002