-------- Original Message --------
Subject: Re: [SWISH-E] Re: libxml2 and non-ascii?
Date: Mon, 22 Nov 2004 11:51:15 +0100
From: Roman Chyla <chyla@knihovnabbb.cz>
To: moseley@hank.org
References: <419E063F.7020104@knihovnabbb.cz>
<20041119145132.GC17200@hank.org>
Hi,
thank you for the link - I played with configuration, but I am afraid
the hints from FAQ can't solve my problem in Windows-1250, nor in
Iso-8859-2 encoding when using libxml2 parser.
I tried also "TranslateCharacters" option, but since the UTF is 16 bit I
can not map it to 8bit characters (did I miss something?)
perhaps, there could be a new TranslateCharactersUTF directive for users
with libxml2 and non-8859-2 characters in docs?
best regards
roman
Bill Moseley wrote:
> On Fri, Nov 19, 2004 at 06:41:31AM -0800, Roman Chyla wrote:
>
>>Hi,
>>
>>I have noticed, that when I use libxml2 on my indexed files, special
>>characters are stripped off (in my case czech characters)
>
>
> Let us know if this doesn't answer your question:
>
> http://swish-e.org/current/docs/SWISH-FAQ.html#How_do_I_index_non_English_words_
>
>
>>Switching to DefaultContents HTML solved that problem - (together with
>>TranslateCharacters directive)
>
>
> The HTML parser is old and broken. But it knows nothing of encodings
> so it will just index 8-bit chars regardless of what they are. But
> that parser does make more mistakes than the libxml2 parser and many
> features are not supported in HTML that are in the HTML2 parser.
>
Received on Mon Nov 22 04:14:29 2004