Skip to main content.
home | support | download

Back to List Archive

Re: fix for my stemmer_en2 issue

From: Brad Miele <bmiele(at)not-real.ipnstock.com>
Date: Sat Nov 11 2006 - 19:31:11 GMT
Peter,

The fixes that you made appear to have corrected the issue, at least for 
the phrases that were reported to me. I am going to push the index to our 
dev servers and let the sales people and somce clients test on it next 
week, but I think we are all set.

Some thoughts on the size thing, the initial report was that Corey Rich 
couldn't be searched, and so my limited test was only with Corey Rich 
files. So it may be possible that it was the addition of other records 
that caused things? Also, no other photographers with the name Corey could 
be searched by their full (first plus last) names. So maybe it worked 
until another person named Corey was added?

I can do more testing of this, but since it is working, I don't know if it 
is needed, let me know if you want me too.

Brad
---------------------
Brad Miele
VP Technology
IPNStock.com
866 476 7862 x902
bmiele@ipnstock.com

On Fri, 10 Nov 2006, Peter Karman wrote:

>
>
> brad miele scribbled on 11/10/06 3:22 PM:
>> yes, it seems that the volume of files/words was a factor, since it
>> didn't/doesn't crop up with smaller sets.
>>
>> this test was on the full set, so i am sort of baffled by why that change
>> would make the difference.
>>
>> i guess i should keep looking for a more real solution. the stemmer_en1
>> doesn't seem to do as good of a job (at least according to our
>> salespeople), and we can't seem to make the jump to 2.4.4 with en2
>>
> > i find that when i remove the two
>>>> references at the top to:
>>>>
>>>>      { FUZZY_STEMMING_EN2,       "Stemming_en",      Stem_snowball,
>>>> porter_create_env, porter_close_env, porter_stem },
>>>>      { FUZZY_STEMMING_EN2,       "Stem",             Stem_snowball,
>>>> porter_create_env, porter_close_env, porter_stem },
>>> That's just a mapping table -- it maps the config names ("None",
>>> "Stemming_en", etc.) to the code for that stemmer.
>>>
>>> The difference between 2.4.3 and 2.4.4 is that we removed the old
>>> Porter stemmer so Stem and Stemming_en were changed to use the new
>>> snowball stemmer code instead of the old Porter code.
>>>
>
> I took a look at the diffs from 2.4.3 through 2.4.4. Looks like there were a
> couple changes: one where I took out the Stemming_en and Stem options, and
> another when I put them back in with a warning.
>
> The difference when I put them back in however was that instead of being
> FUZZY_STEMMING_EN they were changed to FUZZY_STEMMING_EN2. FUZZY_STEMMING_EN was
> dropped from stemmer.h at the same time.
>
> To make matters more confusing, the error message indicates that the deprecated
> features Stemming_en and Stem will use Stemmer_en1 -- but they are marked with
> FUZZY_STEMMING_EN2 even though they call the same init/free functions as
> Stemmer_en1.
>
> So, there's definitely something suspicious in stemmer.c I think. I'm going to
> commit a change to CVS -- Brad, would you take a look at the CVS version and see
> if that works any better?
>
> And here's a little script to test all the stemmers. Use it like:
>
>  perl stemtest.pl wordIwant2stem
>
> and it will show how each stemmer handles wordIwant2stem. Note that the
> SWISH::API 0.04 is required for a working Fuzzify() method.
>
> ------------------------------8<snip--------------------------
> #!/usr/bin/perl
> #
> #   test the Swish-e stemmers
> #
> #
> use strict;
> use warnings;
> use SWISH::API; # requires 0.04 or later for working Fuzzify()
>
> my $usage = "$0 word2stem";
> my $html  = 'stem_test.html';
> my $word  = shift @ARGV or die $usage;
>
> unless (-s $html)
> {
>     open(S, ">$html") or die "can't write $html: $!";
>     print S '<html>some words here that do not matter</html>';
>     close(S);
> }
>
> my @warm_fuzzies = qw(
>   Stemming_en
>   Stem
>   None
>   Soundex
>   Metaphone
>   DoubleMetaphone
>   Stemming_es
>   Stemming_fr
>   Stemming_it
>   Stemming_pt
>   Stemming_de
>   Stemming_nl
>   Stemming_en1
>   Stemming_en2
>   Stemming_no
>   Stemming_se
>   Stemming_dk
>   Stemming_ru
>   Stemming_fi
>   );
>
> for my $f (@warm_fuzzies)
> {
>     my $index = i($f);
>     my $swish = SWISH::API->new($index);
>     my $fuzzy = $swish->Fuzzify($index, $word);
>     print "$f -> " . join(' ', $fuzzy->word_list) . "\n";
> }
>
> sub i
> {
>     my $f = shift;
>     my $index = "$f.index";
>     return $index if -s $index; # don't create more than once.
>     system("echo 'FuzzyIndexingMode $f' > config");
>     system("swish-e -i $html -c config -f $index 1>/dev/null");
>     return $index;
> }
> ------------------------------8<snip--------------------------
>
>
>
> -- 
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>
>
Received on Sat Nov 11 11:31:16 2006