On Tue, Sep 04, 2007 at 12:08:11PM -0400, Matthew Cheetah Gabeler-Lee wrote:
> > Can you set up any test cases I could try?
>
> I believe this test file (containing every char from \x01 to \xFF)
> should work as a test case for the particular problem I hit:
>
> http://fastcat.org/tmp/chars.txt
Maybe I'm not understanding the problem you are having. A file with
\x01 to \xFF is just a string of bytes, not characters. Need to know
the encoding to map it to characters.
The spider can operate on and/or alter the text of the file fetched,
so it must decoded from whatever encoding is returned into Perl's
internal representation.
Then the content must be re-encoded back to the original character
encoding and spider.pl must determine the correct length of that text
in bytes.
After the content is re-encoded I don't see why can't just use
length() on the string to determine the size in bytes. Do you?
moseley@bumby:~$ perl UTF-8-demo.pl
UTF-8-demo.txt length on disk is 14052 bytes
Length of read data is 14052 bytes
Length of read text is 7621 chars
Length of encoded data is 14052 bytes
moseley@bumby:~$ cat UTF-8-demo.pl
#!/usr/bin/perl
use warnings;
use strict;
use Encode;
# test length of a file with utf8 chars
my $file = 'UTF-8-demo.txt';
print "$file length on disk is " . (stat $file)[7] . " bytes\n";
open my $fh, '<', $file;
my $data = join '', <$fh>;
print "Length of read data is " . length( $data ) . " bytes\n";
my $encoded = decode_utf8( $data );
print "Length of read text is " . length( $encoded ) . " chars\n";
$data = encode_utf8( $encoded);
print "Length of encoded data is " . length( $data ) . " bytes\n";
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
--
Bill Moseley
moseley@hank.org
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 4 12:33:25 2007