Skip to main content.
home | support | download

Back to List Archive

[swish-e] The old encoding/length problem with spider.pl

From: Matthew \ <cheetah-swishe(at)not-real.fastcat.org>
Date: Wed Aug 29 2007 - 20:15:04 GMT
[ swish-e 2.4.5, debian linux package ]

I've recently run across a problem with text encodings and spider.pl 
that seems to have been resurfacing occaisonally for a good 5-6 years, 
and I think I may have a suggestion to help.

I played around inserting some logging code into spider.pl, and it 
appears that the problem stems from perl's choice of whether or not to 
output data in utf8.  The pack form is correctly determining the length 
in bytes of the utf8 encoded form of $$content, however when $$content 
is actually printed, it goes out in whatever the encoding for the STDOUT 
stream is, which is not always utf8.  In fact, even if I try to set a 
locale that uses utf8, the pipe between spider.pl and swishe is not 
utf8.  Which isn't surprising, since there's no reason that pipes should 
be assumed to have character and not octet semantics.

I have a couple possible suggestions for workarounds:

1) Check the server charset ... if there is none, then we're almost 
certainly in non-utf8 mode, and should use the non-utf8 length:

About line 1422 in spider.pl, from:
my $bytecount = length pack 'C0a*', $$content;
to:
my $bytecount = $server->{charset} ? length pack 'C0a*', $$content : length $$content;

2) Check the PerlIO layers on STDOUT for utf8.  Similar to above, but 
instead use grep { $_ eq 'utf8' } PerlIO::get_layers(STDOUT) as the 
condition.  This I think would mean that swish-e would require perl >= 
5.8, which may not be desirable.

3) Set the :utf8 i/o layer on STDOUT for spider.pl, so that the data 
written is in the utf8 encoding just as the byte count is.  Again this 
would require perl >= 5.8.

our $outputisutf8;
if (!defined $outputisutf8) {
  $outputisutf8 = (grep { $_ eq 'utf8' } PerlIO::get_layers(STDOUT)) ? 1 : 0;
}
my $bytecount = $outputisutf8 ? length pack 'C0a*', $$content : length $$content;

Finally, I would note that some of this may be due to changes in the xml 
parser or other libraries that swish-e depends on, since it was working 
fine for months until I did my periodic apt-get upgrade yesterday.  Now, 
even with the above fixed, it's blowing up trying to parse every binary 
file it sees as xhtml, producing endless error messages.  It looks like 
I can filter that with the spider config however.

-- 
	-Cheetah
"Reality is that which, when you stop believing in it, doesn't go away".
                -- Philip K. Dick
GPG pubkey fingerprint: A57F B354 FD30 A502 795B 9637 3EF1 3F22 A85E 2AD1
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Aug 29 16:15:05 2007