Nikola wrote on 11/10/07 5:14 AM:
> Thank you,
> I'll try to adjust my WordCharacters setting in order to support Windows-1251.
> But the encoding is already declared in the xml documents. They all starts with
> <?xml version="1.0" encoding="Windows-1251"?>
> Maybe the name of the encoding or even the encoding is not supported by
> libxml2 . Should I try alternative names for this encoding like cp-1251.
What I suspect is going on is something like this.
Your docs are in 1251 encoding. libxml2 converts them to utf-8 internally for
parsing, and then back to latin1 for storage (this is all in parser.c). Because
not every codepoint in 1251 maps to a codepoint in latin1 (IIRC, latin1 uses
fewer of the available 128 codepoints in the upper half of the 256 range than
does 1251), some of the 1251 characters are silently dropped.
Try using the older, expat parser instead of libxml2 by indicating the HTML type
instead of HTML2 or HTML* (same for XML instead of XML2 or XML*). That way the
conversion trip to utf8->latin1 should not happen. And, as I mentioned before,
adjust WordCharacters to include all the 1251 characters.
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Users mailing list
Received on Sun Nov 11 22:00:13 2007