Re: libxml2, utf, local endodings and HTML2 parser

From: Bill Moseley <moseley(at)>
Date: Thu Mar 24 2005 - 14:37:04 GMT
On Thu, Mar 24, 2005 at 01:08:21AM -0800, Roman Chyla wrote:
> Hi,
> libxml2 converts the stuff into utf8 and then sends it to swish-e in 
> iso8859-1. I was looking at libxml2 site, and found it is possible to 
> compile it with iconv support (or more, tell libxml2 to output the 
> document in the original encoding).

That's when writing the tree back to an xml file.  Swish is using the
SAX parser so always gets the data in utf8.

> it is possible to change the way libxml2 outputs to swish-e?

No, but you could change to a different 8-bit encoding for indexing.

> this would help me to use HTML2 parser even for non-iso8859-1 documents. 
> However, what should I look at? How can I do it (if I can)?

In parser.c look at function Convert_to_latin1().  You would need to
replace the call to (libxml's) UTF8Toisolat1() with another function
-- perhaps an iconv function (and adjust the following code to work
with that function).

Bill Moseley

Received on Thu Mar 24 06:37:10 2005