User-Agent: Mutt/1.2.5i
Hi,
I'm evaluating Swish-E for use with data that has been scanned and
OCRred. Looks great. I have some ideas for how to do a fuzzy
search that catches OCR errors, but it involves generating
several indexing words for each word in the input. This is
also something the Double-Metaphone method does now - sometimes
there are two words that are output from the Double-Metaphone
encoding. They are placed at the same word index.
Unfortunately you can't use phrase searches if Metaphone does
this. That's a bit of a downer. My guess is that this is
simply because of the parsing and handling of the query. If
we invented a new symbol for 'phrase' eg. '#' then the (web)
frontend could transform the user's query from say:
"Fred Pollack"
into
fred # pollack
and then into
(fred | fre | frd | red) # (pollack | pollac | pollak | pollck | polack | pllack | ollack)
Is this a way to do it? I'm thinking that all that needs changing
is the parsing of the search expression.
On a related note, I am guessing that you can't do phrase searches
with blocked words? So I can't search for "Sound of music" if
'of' is on the block list? I suppose I can just do without a block
list.
But that raises the question: Can the phrase search software cope
with a word occuring twice in the input? If I have:
Fred Smith and Dorothy Smith
in the input, will a phrase search for "Dorothy Smith" work, or is
the word "Smith" only indexed once (after Fred)?
Lots of questions...
--
Erik Corry erik@arbat.com
Received on Mon Nov 18 14:57:23 2002