Skip to main content.
home | support | download

Back to List Archive

Re: Indexing accent characters

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Aug 12 2003 - 16:34:34 GMT
On Tue, Aug 12, 2003 at 07:02:51AM -0700, Greg Ford wrote:

> I've looked at the FAQ and done some tests - it seems that 
> if those files are in xml or html, libxml2 will convert them to 8859-1 
> But in my tests, latin character with accents e.g AMACRON  (&#257;) 
> are not indexed. I was hoping they would be converted 
> to the plain letter (a) - stripping the accents off would make my 
> data conveniently searchable.  

Sorry, no good solution here.

With ParserWarnLevel set,

  1.txt:1: warning: Failed to convert internal UTF-8 to Latin-1.
  Replacing non ISO-8859-1 char with char ' '
  andes foo &#257;
                ^
That message is from libxml2.  

One option would be to examine the text before converting and do some
text replacement before calling UTF8Toisolat1() (which is a libxml2
function) on the UTF-8 source string.

Or perhaps try to figure out what the UTF-8 character is and then use 
that instead of the space (ENCODE_ERROR_CHAR) character.

The next step is to edit parser.c and replace the code that converts
from UTF-8 to latin-1 with a call to iconv, and allow setting character
sets in the swish-e config file.



-- 
Bill Moseley
moseley@hank.org
Received on Tue Aug 12 16:36:09 2003