On 09/12/2007 04:18 PM, Bill Moseley wrote:
> On Wed, Sep 12, 2007 at 03:51:08PM -0500, Peter Karman wrote:
>> Only problem I see is similar to one we've hit before: docs that have charset
>> declared in the <head> but which do not actually adhere to that charset. So how
>> do you know what the "original charset" really is?
> I'm thinking that should not be a problem (he says without trying)
> because if I can get the content decoded into Perl, then encoding it
> back to that charset should not be a problem as far as Perl is
> concerned. Maybe not be the same doc as going in but should be valid
> characters. The encode and decode methods have a CHECK value to say
> what to do for bad characters.
I guess then the onus is on Perl to deal with mismatched encodings. It actually
seems to do that reasonably well in some cases, horribly in others. The biggest
issue I've seen is when it interprets bytes intended as UTF-8 as Latin1. There
are cases where the same sequence of bytes is valid in both encodings, and Perl
seems to assume Latin1 as the default.
> What I'm not so clear on is how libxml2 will detect the correct charset.
> I think it will look at a <meta> tag, but not all docs will have that.
yes, it does look at the <meta> tag for html, and the <?xml> declaration for xml.
> I wonder if it would not be smart to try and parse the document and
> try and replace. How that would be done is different for xml or html
> docs, and working with plain text is something else.
> Then there's the issue of filters.
> Do you remember what libxml2 does if it doesn't find an explicit
I believe it tries to guess the encoding based on N bytes from the beginning of
the doc. libxml2 converts everything to utf8 internally, and I believe that if
the encoding is not declared and can't be guessed (as with utf-16 for example),
then it will default to assuming utf8.
Here's an example doc I call aacute.html. It has 3 &accute; characters in the
<body> -- the first in utf8, the second as a hex entity, and the third in
latin1 (depending on the encoding your email client uses, you'll either see
these legibly or not). So this is a broken doc from an encoding perspective
(but not unlike real HTML docs out in the wild...).
Here's the bytes.
% hexdump -C aacute.html
000000d0 61 64 3e 0a 20 3c 62 6f 64 79 3e c3 a1 26 23 78 |ad>. <body>..&#x|
000000e0 45 31 3b e1 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 |E1;.</body>.</ht|
Which says that the byte sequence for the utf8 character is:
while the latin1 byte is:
(which is the same as the hex entity).
The problem of course is that \xc3 and \xa1 are both valid latin1 characters
(Atilde and iexcl). I've copied them here in utf8:
When I run it through xmllint with a variety of options, you can see libxml2
attempting to respect the declared charset if using the HTML parser (the XML
parser is much more strict and assumes utf8 if no encoding is declared). First
I try it with no explicit encoding declaration:
% xmllint --html aacute.html
and it encodes the bytes as HTML char entities -- but notice that it assumes
latin1 because I used the --html option with no explicit encoding.
So I'm guessing it defaults to latin1 if using the HTML parser with no explicit
Then with utf8 explicitly declared, with:
<META http-equiv="Content-Type" content="text/html; charset=utf-8" />
in the <head>.
% xmllint --html aacute.html
aacute.html:7: HTML parser error : Input is not proper UTF-8, indicate encoding !
so that's right. It can't decipher the last aacute because it is latin1 (the
same error is generated when using the XML parser with no encoding declared).
Then again with iso-8859-1 declared:
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
% xmllint --html aacute.html
No errors. That's because the first utf8 aacute byte sequence is valid latin1.
But notice that instead of encoding as char entities, it just outputs latin1
because libxml2 will preserve the declared encoding unless you ask it not to,
with the --encode option:
That should look all garbled if your mailer uses utf8 like mine does, since it
will interpret the first 2 bytes as a single utf8 aacute, instead of 2 latin1
characters, and it will confuse the leading < in the </body> as the 2nd byte of
a utf8 character. hexdump says the bytes are:
c3 a1 e1 e1 3c
If you're following this far, you can see that this gets complicated quickly.
Which is why it's still a problem with swish-e. :)
>> In The Future, it'd be nice for spider.pl to just standardize on utf-8 for
>> output to swish-e. But of course, that's when swish-e can handle utf-8. :)
> It could do that now since libxml2 will read the utf8 fine. Swish, in
> parser.c converts to 8859-1 when reading the output from libxml2.
> Again, just would need to remove any charset meta tags (or replace).
> And then I'm not sure what to do with filtered content. But, it's the
> filters responsibility to correctly get the content decoded
Sure, I know that swish-e handles utf-8 by downgrading it to latin1. But that's
not really useful if you've got non-latin1-compatible utf8 characters. But this
is an old path we've walked before.
I agree about the filters needing to decode correctly. Perhaps there should be
some utility methods in the base SWISH::Filter class for encode/decode, just
wrappers around Encode or something.
For now, I think spider.pl should try and preserve any declared encoding if
possible, and if not possible, the error should prevent it from being passed on
to swish-e at all. As Marvin Humphrey likes to say, catastrophic failure is
good -- in this case, at the spider.pl level rather than later at the swish-e
level, where recovery is next to impossible.
Peter Karman . peter(at)not-real.peknet.com . http://peknet.com/
Users mailing list
Received on Thu Sep 13 11:09:17 2007