On Tue, 4 Sep 2007, Bill Moseley wrote:
> Maybe I'm not understanding the problem you are having. A file with
> \x01 to \xFF is just a string of bytes, not characters. Need to know
> the encoding to map it to characters.
A text file with several "ascii" chars >128 was exactly the type of file
causing me problems. I posted a degenerate file like the above simply
as a simple worst case example, since I couldn't quickly find the 3
offending characters in the original (long) text document.
In my case, at least, the correct assumption would be that, lacking any
encoding information, content is 8bit characters. In the general case,
for content with no encoding specified, it probably should either look
for a utf8 BOM marker before assuming that it is utf8?
> The spider can operate on and/or alter the text of the file fetched,
> so it must decoded from whatever encoding is returned into Perl's
> internal representation.
> Then the content must be re-encoded back to the original character
> encoding and spider.pl must determine the correct length of that text
> in bytes.
> After the content is re-encoded I don't see why can't just use
> length() on the string to determine the size in bytes. Do you?
What happened is that, because of the high characters, the bytes length
from perl was the number of bytes to encode the original data in utf8.
However, when it printed the original data out to the stream, it did not
do so in utf8, but instead faithfully reproduced the original 8bit
Some of this problem probably stems from the content being text/plain,
and thus not having a good way to specify the content encoding. Swishe
appears to assume that, in the absence of a source encoding
specification, the document is in perl's internal utf8-ish encoding, but
perl is assuming the content is a stream of 8bit characters, at least
for the purposes of printing it back out.
"Reality is that which, when you stop believing in it, doesn't go away".
-- Philip K. Dick
GPG pubkey fingerprint: A57F B354 FD30 A502 795B 9637 3EF1 3F22 A85E 2AD1
Users mailing list
Received on Tue Sep 4 13:15:05 2007