Skip to main content.
home | support | download

Back to List Archive

Fuzzy Indexing with Double Metaphone

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 20 2002 - 23:51:10 GMT
Ok, so 2.1-dev now has an implementation of Lawrence Philips' Metaphone
Algorithm.  Actually, I used the code from Maurice Aubrey's
Text::DoubleMetaphone Perl module.  That should be very similar to the code
used in Aspell and PHP.

Metaphone algorithms are commonly used for looking up misspelled words in a
dictionary.

I also added a new configuration directive for selecting the fuzzy mode of
indexing (and thus searching).

   FuzzyIndexingMode

gives a nice warm feeling, no?

Available choices are

   FuzzyIndexingMode None|Stemmer|Soundex|Metaphone|DoubleMetaphone

You can still say "UseStemming yes" or "UseSoundex yes", but probably will
be depreciated.

For example, indexing the Apache docs results in these word counts:

   None            = 7270 unique words
   Stemming        = 5149
   Soundex         = 2651
   Metaphone       = 2957
   DoubleMetaphone = 3189

My guess is that stemming will be about as fuzzy as most people will want.

The fuzzy modes are less than perfect, of course.  Your mileage will vary,
as they say.  Metaphone does some weird things -- for example the current
code seems to strip off digits so that if you index a mixed word like
"2000s" it will just see the "s" and ignore the digits.

DoubleMetaphone is where the metaphone code returns two different
metaphones for a given word (only in some cases).  The idea is that some
words may be pronounced differently.  Maybe not a good example, but
"search" has two metaphones:

 White-space found word 'search'
    Adding:[1:swishdefault(1)]   'SRX'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'SRK'   Pos:6  Stuct:0x1 ( FILE )

Swish indexes both of the metaphones at the same word position.  This
should allow other (better?) sound-alike matches.  That is, words that have
a metaphone value of "SRX" will match, as will other words that match the
other "SRK".

When searching if a word has two metaphones swish will expand the word.
So, a search for -w search will actually be expanded:

  # Search Words: search
  # Parsed Words: ( SRX or SRK )

But there's a problem.  If the word is in a phrase swish will not find the
match.  That's because swish cannot expand an OR sub-query within a phrase.

   -w '"foo (bar or baz)"'

doesn't currently work.  Hopefully that will be changed soon.  Until then
DoubleMetaphone probably should not be used.

Another minor point is that you can only select one method at a time.  I
guess it is possible someone would want to apply stemming and then apply
metaphone, but that's not the way I coded it.  Speak up if you thinks that
is wrong.  I left this until the end of this message to reduce the
likelihood of that happening...

 





-- 
Bill Moseley
mailto:moseley@hank.org
Received on Tue Aug 20 23:54:39 2002