OCR/Double Metaphone phrase issue

From: Erik Corry <erik(at)>
Date: Mon Nov 18 2002 - 14:57:05 GMT



I'm evaluating Swish-E for use with data that has been scanned and
OCRred.  Looks great.  I have some ideas for how to do a fuzzy
search that catches OCR errors, but it involves generating
several indexing words for each word in the input.  This is
also something the Double-Metaphone method does now - sometimes
there are two words that are output from the Double-Metaphone
encoding.   They are placed at the same word index.

Unfortunately you can't use phrase searches if Metaphone does
this.  That's a bit of a downer.  My guess is that this is
simply because of the parsing and handling of the query.  If
we invented a new symbol for 'phrase' eg. '#' then the (web)
frontend could transform the user's query from say:

"Fred Pollack"


fred # pollack

and then into

(fred | fre | frd | red) # (pollack | pollac | pollak | pollck | polack | pllack | ollack)

Is this a way to do it?  I'm thinking that all that needs changing
is the parsing of the search expression.

On a related note, I am guessing that you can't do phrase searches
with blocked words?  So I can't search for "Sound of music" if
'of' is on the block list?  I suppose I can just do without a block

But that raises the question:  Can the phrase search software cope
with a word occuring twice in the input?  If I have:

Fred Smith and Dorothy Smith

in the input, will a phrase search for "Dorothy Smith" work, or is
the word "Smith" only indexed once (after Fred)?

Lots of questions...

Erik Corry
