On Wed, Aug 29, 2007 at 04:15:04PM -0400, Matthew Cheetah Gabeler-Lee wrote:
> [ swish-e 2.4.5, debian linux package ]
> I've recently run across a problem with text encodings and spider.pl
> that seems to have been resurfacing occaisonally for a good 5-6 years,
> and I think I may have a suggestion to help.
I've been on vacation so haven't had time to look at this in detail.
The spider needs to decoded the fetched content (LWP does this,
actually), and then work with it as a perl string. What would be easy
is to then just encode to utf8 and send to swish-e (for libxml2 to
parse). And bytes::length() should give the length in bytes for swish
to read in.
IIRC, the problem is that there might be a charset in a <meta> tag
that would indicate to libxml2 that it was not utf8 encoding. So, I'd
need to look at that again.
> I played around inserting some logging code into spider.pl, and it
> appears that the problem stems from perl's choice of whether or not to
> output data in utf8. The pack form is correctly determining the length
> in bytes of the utf8 encoded form of $$content, however when $$content
> is actually printed, it goes out in whatever the encoding for the STDOUT
> stream is, which is not always utf8. In fact, even if I try to set a
> locale that uses utf8, the pipe between spider.pl and swishe is not
> utf8. Which isn't surprising, since there's no reason that pipes should
> be assumed to have character and not octet semantics.
Ah, I'm not sure I looked at what the layer might be for STDOUT. I
guess I assumed that was not an issue with a pipe. Again, something I
need to look at in more detail.
Can you set up any test cases I could try?
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Users mailing list
Received on Mon Sep 3 00:55:47 2007