Skip to main content.
home | support | download

Back to List Archive

Re: Problems with ISO 8859-1 to UTF-8 Conversion?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Sep 13 2002 - 13:27:38 GMT
On Fri, 13 Sep 2002, Thomas Seifert wrote:

> <?xml version="1.0" encoding="ISO-8859-1"?>
> <xml>
> <titel>L'instit : Le choix de Théo</titel><desc>francetélévision (France2)|1 
> Boulevard Victor, Immeuble Le Barjac|F75015|Paris|11-09-02 
> 21:10||||</desc></xml>

> In the Config File I use the "TranslateCharacters :ascii7:" Parameter which 
> should index "Théo" as "Theo" (as I understood with this feature only the 
> Index is converted, not the actual text) so that i could search for "theo" 
> and find the above document.

Can you switch to using the XML2 parser?  This is a problem in the swish-e
xml parser -- it's not converting the UTF-8 back to an eight-bit only char
set.

The plan is to add iconv to the parsers in 2.3 development, but currently
only the XML2 parser converts UTF-8 back to 8859-1.

Here's another problem -- which might be a bug.

In html if you say  <em><strong>H</strong>ello</em> you would expect
"Hello" to be indexed as a single word.  But that
means your XML will end up with the word: 'theofrancetelevision' which is
probably not what you want.

If <titel> or <desc> are metanames or properties then that won't happen.

-- 
Bill Moseley moseley@hank.org
Received on Fri Sep 13 13:31:12 2002