On Tue, Sep 04, 2007 at 01:15:04PM -0400, Matthew Cheetah Gabeler-Lee wrote:
> In my case, at least, the correct assumption would be that, lacking any
> encoding information, content is 8bit characters.
But there's more than one 8-bit encoding.
> In the general case, for content with no encoding specified, it
> probably should either look for a utf8 BOM marker before assuming
> that it is utf8?
But a BOM is not required for utf8 or any other encoding, so cannot
count on it being there. If there's no encoding sent by the server I
think I have to assume ISO-8859-1.
>
> > The spider can operate on and/or alter the text of the file fetched,
> > so it must decoded from whatever encoding is returned into Perl's
> > internal representation.
> >
> > Then the content must be re-encoded back to the original character
> > encoding and spider.pl must determine the correct length of that text
> > in bytes.
> >
> > After the content is re-encoded I don't see why can't just use
> > length() on the string to determine the size in bytes. Do you?
>
> What happened is that, because of the high characters, the bytes length
> from perl was the number of bytes to encode the original data in utf8.
> However, when it printed the original data out to the stream, it did not
> do so in utf8, but instead faithfully reproduced the original 8bit
> ascii-ish form.
Perhaps. Which is why I think it would be better to just encode the
perl text back into the original encoding and then simply use length()
to determine the size in bytes.
> Some of this problem probably stems from the content being text/plain,
> and thus not having a good way to specify the content encoding. Swishe
> appears to assume that, in the absence of a source encoding
> specification, the document is in perl's internal utf8-ish encoding, but
> perl is assuming the content is a stream of 8bit characters, at least
> for the purposes of printing it back out.
No, spider.pl uses HTTP::Message's decoded_content() method which defaults
to ISO-8859-1. So if the server isn't returning a charset then that's
what text/plain would default to. Or so I assume.
But, there might be a problem that spider.pl doesn't make that same
assumption when re-encoding.
It would be helpful if HTTP::Message provided a separate method to
return the charset so that HTTP::Message and spider.pl used the same
code for determining the charset. IIRC, I asked on the LWP list for
that feature.
Try changing to this (un-tested) code:
for ( $response->header('content-type') ) {
$server->{charset} = $1 if /\bcharset=([^;]+)/;
}
$sever->{charset} ||= 'ISO-8859-1'; # add this line
$$content = Encode::encode( $server->{charset}, $$content, Encode::FB_CROAK );
and later just do:
my $bytecount = length $$content;
The reason for CROAK is that it shouldn't croak unless you modified
the content with characters that cannot be encoded into the specified
charset.
And just calling length on $$content should return the
correct length in bytes since the string is now encoded into a byte
string.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 4 13:57:13 2007