Bill Moseley scribbled on 11/11/06 9:47 AM:
> On Fri, Nov 10, 2006 at 09:24:12PM -0800, Peter Karman wrote:
>> The difference when I put them back in however was that instead of being
>> FUZZY_STEMMING_EN they were changed to FUZZY_STEMMING_EN2. FUZZY_STEMMING_EN was
>> dropped from stemmer.h at the same time.
>> To make matters more confusing, the error message indicates that the deprecated
>> features Stemming_en and Stem will use Stemmer_en1 -- but they are marked with
>> FUZZY_STEMMING_EN2 even though they call the same init/free functions as
> Oh, that's not good.
>> So, there's definitely something suspicious in stemmer.c I think. I'm going to
>> commit a change to CVS -- Brad, would you take a look at the CVS version and see
>> if that works any better?
> This will require re-indexing. That table maps the configuration
> names to an index number used to indicate the stemmer -- and that
> number is stored in the index to know what stemmer to use when
yes, I got to thinking about this some more last night after I checked in that
change to stemmer.c. I think there was also a problem in stemmer.h, since I had
removed FUZZY_STEMMING_EN altogether, which basically meant there was an
off-by-1 difference in 2.4.3 vs 2.4.4 indexes wrt to the stemmer. It would only
manifest if you used stemming (which I don't) and searched a 2.4.3 index using
> Brad's original config had:
> FuzzyIndexingMode Stemming_en2
> which mapped to the "english" stemmer and stored FUZZY_STEMMING_EN2 in
> the index. Then when searching FUZZY_STEMMING_EN2 was searched in the
> table and found the "porter" stemmer as could be seen in his headers:
> # Fuzzy Mode: Stemming_en
> Which could cause problems. What I'm still confused about is why the
> size of the index would have made a difference.
that might be a red herring.
The header reports Stemming_en because of the order of the fuzzy_opts array.
Last night's stemmer.c just reordered those. get_fuzzy_mode() just picks the
first FUZZY_STEMMING_EN2 it finds. The stemmer.c I'm about to check in further
reorders fuzzy_opts to put the deprecated options last in the list, so they
don't get listed first and confuse folks.
> Peter, that fuzzy_mode index must match up to only one stemmer, but
> there can be multiple entires for a give fuzzy_mode to allow for
> aliases (Stem, Stemming_en, Stemming_en1, for example).
I think CVS is right now. I checked in a new stemmer.h just a little bit ago
that should provide backwards compat with 2.4.3 indexes (by putting back in the
enum value that could cause off-by-one), and the stemmer.c I just checked in
should be a little saner.
I see Brad reports that last night's stemmer.c change did the trick for his
particular case; I suspect it was the EN1/EN2 mixup that was at fault.
I did discover, while checking 2.4.3, 2.4.4 and CVS, that 2.4.4 did in fact
break the 2.4.3 index format in some way. Unknown header 32. I suspect it's
related to the RemovedWords/RemovedFiles features of the increm version, in
db_read.c. But not sure on that. There were a lot of changes in db_read.c
shortly after 2.4.3 was released, which is a lot of water under the bridge
before 2.4.4 was released...
What that means is that indexes created with 2.4.3 can be read by 2.4.4, but not
the other way around. That's actually ok, I think, since I assume that the
working case is going to be the most common. However, it wasn't clearly
documented in the Changes anywhere, and we likely should have changed the Magic
Number to make it explicit. I am not going to change the Magic Number now, since
that defeats the fix I put in stemmer.h with making CVS backwards compatible
with 2.4.3. But it ought to get changed for 2.4.5 (whenever that is...).
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Sat Nov 11 13:51:27 2006