Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] The old encoding/length problem with spider.pl

From: Matthew \ <cheetah-swishe(at)not-real.fastcat.org>
Date: Tue Sep 04 2007 - 17:15:04 GMT
On Tue, 4 Sep 2007, Bill Moseley wrote:

> Maybe I'm not understanding the problem you are having.  A file with
> \x01 to \xFF is just a string of bytes, not characters.  Need to know
> the encoding to map it to characters.

A text file with several "ascii" chars >128 was exactly the type of file 
causing me problems.  I posted a degenerate file like the above simply 
as a simple worst case example, since I couldn't quickly find the 3 
offending characters in the original (long) text document.

In my case, at least, the correct assumption would be that, lacking any 
encoding information, content is 8bit characters.  In the general case, 
for content with no encoding specified, it probably should either look 
for a utf8 BOM marker before assuming that it is utf8?

> The spider can operate on and/or alter the text of the file fetched,
> so it must decoded from whatever encoding is returned into Perl's
> internal representation.
> 
> Then the content must be re-encoded back to the original character
> encoding and spider.pl must determine the correct length of that text
> in bytes.
> 
> After the content is re-encoded I don't see why can't just use
> length() on the string to determine the size in bytes.  Do you?

What happened is that, because of the high characters, the bytes length 
from perl was the number of bytes to encode the original data in utf8.  
However, when it printed the original data out to the stream, it did not 
do so in utf8, but instead faithfully reproduced the original 8bit 
ascii-ish form.

Some of this problem probably stems from the content being text/plain, 
and thus not having a good way to specify the content encoding.  Swishe 
appears to assume that, in the absence of a source encoding 
specification, the document is in perl's internal utf8-ish encoding, but 
perl is assuming the content is a stream of 8bit characters, at least 
for the purposes of printing it back out.

-- 
	-Cheetah
"Reality is that which, when you stop believing in it, doesn't go away".
                -- Philip K. Dick
GPG pubkey fingerprint: A57F B354 FD30 A502 795B 9637 3EF1 3F22 A85E 2AD1
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 4 13:15:05 2007