Ok, so 2.1-dev now has an implementation of Lawrence Philips' Metaphone
Algorithm. Actually, I used the code from Maurice Aubrey's
Text::DoubleMetaphone Perl module. That should be very similar to the code
used in Aspell and PHP.
Metaphone algorithms are commonly used for looking up misspelled words in a
I also added a new configuration directive for selecting the fuzzy mode of
indexing (and thus searching).
gives a nice warm feeling, no?
Available choices are
You can still say "UseStemming yes" or "UseSoundex yes", but probably will
For example, indexing the Apache docs results in these word counts:
None = 7270 unique words
Stemming = 5149
Soundex = 2651
Metaphone = 2957
DoubleMetaphone = 3189
My guess is that stemming will be about as fuzzy as most people will want.
The fuzzy modes are less than perfect, of course. Your mileage will vary,
as they say. Metaphone does some weird things -- for example the current
code seems to strip off digits so that if you index a mixed word like
"2000s" it will just see the "s" and ignore the digits.
DoubleMetaphone is where the metaphone code returns two different
metaphones for a given word (only in some cases). The idea is that some
words may be pronounced differently. Maybe not a good example, but
"search" has two metaphones:
White-space found word 'search'
Adding:[1:swishdefault(1)] 'SRX' Pos:6 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'SRK' Pos:6 Stuct:0x1 ( FILE )
Swish indexes both of the metaphones at the same word position. This
should allow other (better?) sound-alike matches. That is, words that have
a metaphone value of "SRX" will match, as will other words that match the
When searching if a word has two metaphones swish will expand the word.
So, a search for -w search will actually be expanded:
# Search Words: search
# Parsed Words: ( SRX or SRK )
But there's a problem. If the word is in a phrase swish will not find the
match. That's because swish cannot expand an OR sub-query within a phrase.
-w '"foo (bar or baz)"'
doesn't currently work. Hopefully that will be changed soon. Until then
DoubleMetaphone probably should not be used.
Another minor point is that you can only select one method at a time. I
guess it is possible someone would want to apply stemming and then apply
metaphone, but that's not the way I coded it. Speak up if you thinks that
is wrong. I left this until the end of this message to reduce the
likelihood of that happening...
Received on Tue Aug 20 23:54:39 2002