Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] The old encoding/length problem with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Sep 04 2007 - 17:57:12 GMT
On Tue, Sep 04, 2007 at 01:15:04PM -0400, Matthew Cheetah Gabeler-Lee wrote:
> In my case, at least, the correct assumption would be that, lacking any 
> encoding information, content is 8bit characters.

But there's more than one 8-bit encoding.


> In the general case, for content with no encoding specified, it
> probably should either look for a utf8 BOM marker before assuming
> that it is utf8?

But a BOM is not required for utf8 or any other encoding, so cannot
count on it being there.  If there's no encoding sent by the server I
think I have to assume ISO-8859-1. 


> 
> > The spider can operate on and/or alter the text of the file fetched,
> > so it must decoded from whatever encoding is returned into Perl's
> > internal representation.
> > 
> > Then the content must be re-encoded back to the original character
> > encoding and spider.pl must determine the correct length of that text
> > in bytes.
> > 
> > After the content is re-encoded I don't see why can't just use
> > length() on the string to determine the size in bytes.  Do you?
> 
> What happened is that, because of the high characters, the bytes length 
> from perl was the number of bytes to encode the original data in utf8.  
> However, when it printed the original data out to the stream, it did not 
> do so in utf8, but instead faithfully reproduced the original 8bit 
> ascii-ish form.

Perhaps.  Which is why I think it would be better to just encode the
perl text back into the original encoding and then simply use length()
to determine the size in bytes.

> Some of this problem probably stems from the content being text/plain, 
> and thus not having a good way to specify the content encoding.  Swishe 
> appears to assume that, in the absence of a source encoding 
> specification, the document is in perl's internal utf8-ish encoding, but 
> perl is assuming the content is a stream of 8bit characters, at least 
> for the purposes of printing it back out.

No, spider.pl uses HTTP::Message's decoded_content() method which defaults
to ISO-8859-1.  So if the server isn't returning a charset then that's
what text/plain would default to.  Or so I assume.

But, there might be a problem that spider.pl doesn't make that same
assumption when re-encoding.

It would be helpful if HTTP::Message provided a separate method to
return the charset so that HTTP::Message and spider.pl used the same
code for determining the charset.  IIRC, I asked on the LWP list for
that feature.

Try changing to this (un-tested) code:

    for ( $response->header('content-type') ) {
        $server->{charset} = $1 if /\bcharset=([^;]+)/;
    }

    $sever->{charset} ||= 'ISO-8859-1';  # add this line

    $$content = Encode::encode( $server->{charset}, $$content, Encode::FB_CROAK );

and later just do:

    my $bytecount = length $$content;

The reason for CROAK is that it shouldn't croak unless you modified
the content with characters that cannot be encoded into the specified
charset.

And just calling length on $$content should return the
correct length in bytes since the string is now encoded into a byte
string.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 4 13:57:13 2007