Skip to main content.
home | support | download

Back to List Archive

Re: fix for my stemmer_en2 issue

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sat Nov 11 2006 - 05:26:43 GMT
brad miele scribbled on 11/10/06 3:22 PM:
> yes, it seems that the volume of files/words was a factor, since it 
> didn't/doesn't crop up with smaller sets.
> 
> this test was on the full set, so i am sort of baffled by why that change 
> would make the difference.
> 
> i guess i should keep looking for a more real solution. the stemmer_en1 
> doesn't seem to do as good of a job (at least according to our 
> salespeople), and we can't seem to make the jump to 2.4.4 with en2
> 
 > i find that when i remove the two
>>> references at the top to:
>>>
>>>      { FUZZY_STEMMING_EN2,       "Stemming_en",      Stem_snowball,
>>> porter_create_env, porter_close_env, porter_stem },
>>>      { FUZZY_STEMMING_EN2,       "Stem",             Stem_snowball,
>>> porter_create_env, porter_close_env, porter_stem },
>> That's just a mapping table -- it maps the config names ("None",
>> "Stemming_en", etc.) to the code for that stemmer.
>>
>> The difference between 2.4.3 and 2.4.4 is that we removed the old
>> Porter stemmer so Stem and Stemming_en were changed to use the new
>> snowball stemmer code instead of the old Porter code.
>>

I took a look at the diffs from 2.4.3 through 2.4.4. Looks like there were a 
couple changes: one where I took out the Stemming_en and Stem options, and 
another when I put them back in with a warning.

The difference when I put them back in however was that instead of being 
FUZZY_STEMMING_EN they were changed to FUZZY_STEMMING_EN2. FUZZY_STEMMING_EN was 
dropped from stemmer.h at the same time.

To make matters more confusing, the error message indicates that the deprecated 
features Stemming_en and Stem will use Stemmer_en1 -- but they are marked with 
FUZZY_STEMMING_EN2 even though they call the same init/free functions as 
Stemmer_en1.

So, there's definitely something suspicious in stemmer.c I think. I'm going to 
commit a change to CVS -- Brad, would you take a look at the CVS version and see 
if that works any better?

And here's a little script to test all the stemmers. Use it like:

  perl stemtest.pl wordIwant2stem

and it will show how each stemmer handles wordIwant2stem. Note that the 
SWISH::API 0.04 is required for a working Fuzzify() method.

------------------------------8<snip--------------------------
#!/usr/bin/perl
#
#   test the Swish-e stemmers
#
#
use strict;
use warnings;
use SWISH::API; # requires 0.04 or later for working Fuzzify()

my $usage = "$0 word2stem";
my $html  = 'stem_test.html';
my $word  = shift @ARGV or die $usage;

unless (-s $html)
{
     open(S, ">$html") or die "can't write $html: $!";
     print S '<html>some words here that do not matter</html>';
     close(S);
}

my @warm_fuzzies = qw(
   Stemming_en
   Stem
   None
   Soundex
   Metaphone
   DoubleMetaphone
   Stemming_es
   Stemming_fr
   Stemming_it
   Stemming_pt
   Stemming_de
   Stemming_nl
   Stemming_en1
   Stemming_en2
   Stemming_no
   Stemming_se
   Stemming_dk
   Stemming_ru
   Stemming_fi
   );

for my $f (@warm_fuzzies)
{
     my $index = i($f);
     my $swish = SWISH::API->new($index);
     my $fuzzy = $swish->Fuzzify($index, $word);
     print "$f -> " . join(' ', $fuzzy->word_list) . "\n";
}

sub i
{
     my $f = shift;
     my $index = "$f.index";
     return $index if -s $index; # don't create more than once.
     system("echo 'FuzzyIndexingMode $f' > config");
     system("swish-e -i $html -c config -f $index 1>/dev/null");
     return $index;
}
------------------------------8<snip--------------------------



-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri Nov 10 21:26:48 2006