Skip to main content.
home | support | download

Back to List Archive

Re: libxml2 and non-ascii?]

From: Roman Chyla <chyla(at)not-real.knihovnabbb.cz>
Date: Mon Nov 22 2004 - 12:14:22 GMT
-------- Original Message --------
Subject: Re: [SWISH-E] Re: libxml2 and non-ascii?
Date: Mon, 22 Nov 2004 11:51:15 +0100
From: Roman Chyla <chyla@knihovnabbb.cz>
To: moseley@hank.org
References: <419E063F.7020104@knihovnabbb.cz> 
<20041119145132.GC17200@hank.org>

Hi,

thank you for the link - I played with configuration, but I am afraid
the hints from FAQ can't solve my problem in Windows-1250, nor in
Iso-8859-2 encoding when using libxml2 parser.

I tried also "TranslateCharacters" option, but since the UTF is 16 bit I
can not map it to 8bit characters (did I miss something?)

perhaps, there could be a new TranslateCharactersUTF directive for users
with libxml2 and non-8859-2 characters in docs?

best regards

roman

Bill Moseley wrote:
> On Fri, Nov 19, 2004 at 06:41:31AM -0800, Roman Chyla wrote:
> 
>>Hi,
>>
>>I have noticed, that when I use libxml2 on my indexed files, special 
>>characters are stripped off (in my case czech characters)
> 
> 
> Let us know if this doesn't answer your question:
> 
> http://swish-e.org/current/docs/SWISH-FAQ.html#How_do_I_index_non_English_words_
> 
> 
>>Switching to DefaultContents HTML solved that problem - (together with 
>>TranslateCharacters directive)
> 
> 
> The HTML parser is old and broken.  But it knows nothing of encodings
> so it will just index 8-bit chars regardless of what they are.  But
> that parser does make more mistakes than the libxml2 parser and many
> features are not supported in HTML that are in the HTML2 parser.
> 
Received on Mon Nov 22 04:14:29 2004