Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] The old encoding/length problem with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Sep 04 2007 - 16:33:24 GMT
On Tue, Sep 04, 2007 at 12:08:11PM -0400, Matthew Cheetah Gabeler-Lee wrote:
> > Can you set up any test cases I could try?
> 
> I believe this test file (containing every char from \x01 to \xFF) 
> should work as a test case for the particular problem I hit:
> 
> http://fastcat.org/tmp/chars.txt

Maybe I'm not understanding the problem you are having.  A file with
\x01 to \xFF is just a string of bytes, not characters.  Need to know
the encoding to map it to characters.

The spider can operate on and/or alter the text of the file fetched,
so it must decoded from whatever encoding is returned into Perl's
internal representation.

Then the content must be re-encoded back to the original character
encoding and spider.pl must determine the correct length of that text
in bytes.

After the content is re-encoded I don't see why can't just use
length() on the string to determine the size in bytes.  Do you?

moseley@bumby:~$ perl UTF-8-demo.pl
UTF-8-demo.txt length on disk is 14052 bytes
Length of read data is 14052 bytes
Length of read text is 7621 chars
Length of encoded data is 14052 bytes


moseley@bumby:~$ cat UTF-8-demo.pl 
#!/usr/bin/perl
use warnings;
use strict;
use Encode;

# test length of a file with utf8 chars

my $file = 'UTF-8-demo.txt';

print "$file length on disk is " . (stat $file)[7] . " bytes\n";

open my $fh, '<', $file;
my $data = join '', <$fh>;
print "Length of read data is " . length( $data ) . " bytes\n";

my $encoded = decode_utf8( $data );
print "Length of read text is " . length( $encoded ) . " chars\n";

$data = encode_utf8( $encoded);
print "Length of encoded data is " . length( $data ) . " bytes\n";


http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt


-- 
Bill Moseley
moseley@hank.org

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 4 12:33:25 2007