Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] The old encoding/length problem with

From: Bill Moseley <moseley(at)>
Date: Tue Sep 04 2007 - 16:33:24 GMT
On Tue, Sep 04, 2007 at 12:08:11PM -0400, Matthew Cheetah Gabeler-Lee wrote:
> > Can you set up any test cases I could try?
> I believe this test file (containing every char from \x01 to \xFF) 
> should work as a test case for the particular problem I hit:

Maybe I'm not understanding the problem you are having.  A file with
\x01 to \xFF is just a string of bytes, not characters.  Need to know
the encoding to map it to characters.

The spider can operate on and/or alter the text of the file fetched,
so it must decoded from whatever encoding is returned into Perl's
internal representation.

Then the content must be re-encoded back to the original character
encoding and must determine the correct length of that text
in bytes.

After the content is re-encoded I don't see why can't just use
length() on the string to determine the size in bytes.  Do you?

moseley@bumby:~$ perl
UTF-8-demo.txt length on disk is 14052 bytes
Length of read data is 14052 bytes
Length of read text is 7621 chars
Length of encoded data is 14052 bytes

moseley@bumby:~$ cat 
use warnings;
use strict;
use Encode;

# test length of a file with utf8 chars

my $file = 'UTF-8-demo.txt';

print "$file length on disk is " . (stat $file)[7] . " bytes\n";

open my $fh, '<', $file;
my $data = join '', <$fh>;
print "Length of read data is " . length( $data ) . " bytes\n";

my $encoded = decode_utf8( $data );
print "Length of read text is " . length( $encoded ) . " chars\n";

$data = encode_utf8( $encoded);
print "Length of encoded data is " . length( $data ) . " bytes\n";

Bill Moseley

Users mailing list
Received on Tue Sep 4 12:33:25 2007