[ swish-e 2.4.5, debian linux package ]
I've recently run across a problem with text encodings and spider.pl
that seems to have been resurfacing occaisonally for a good 5-6 years,
and I think I may have a suggestion to help.
I played around inserting some logging code into spider.pl, and it
appears that the problem stems from perl's choice of whether or not to
output data in utf8. The pack form is correctly determining the length
in bytes of the utf8 encoded form of $$content, however when $$content
is actually printed, it goes out in whatever the encoding for the STDOUT
stream is, which is not always utf8. In fact, even if I try to set a
locale that uses utf8, the pipe between spider.pl and swishe is not
utf8. Which isn't surprising, since there's no reason that pipes should
be assumed to have character and not octet semantics.
I have a couple possible suggestions for workarounds:
1) Check the server charset ... if there is none, then we're almost
certainly in non-utf8 mode, and should use the non-utf8 length:
About line 1422 in spider.pl, from:
my $bytecount = length pack 'C0a*', $$content;
to:
my $bytecount = $server->{charset} ? length pack 'C0a*', $$content : length $$content;
2) Check the PerlIO layers on STDOUT for utf8. Similar to above, but
instead use grep { $_ eq 'utf8' } PerlIO::get_layers(STDOUT) as the
condition. This I think would mean that swish-e would require perl >=
5.8, which may not be desirable.
3) Set the :utf8 i/o layer on STDOUT for spider.pl, so that the data
written is in the utf8 encoding just as the byte count is. Again this
would require perl >= 5.8.
our $outputisutf8;
if (!defined $outputisutf8) {
$outputisutf8 = (grep { $_ eq 'utf8' } PerlIO::get_layers(STDOUT)) ? 1 : 0;
}
my $bytecount = $outputisutf8 ? length pack 'C0a*', $$content : length $$content;
Finally, I would note that some of this may be due to changes in the xml
parser or other libraries that swish-e depends on, since it was working
fine for months until I did my periodic apt-get upgrade yesterday. Now,
even with the above fixed, it's blowing up trying to parse every binary
file it sees as xhtml, producing endless error messages. It looks like
I can filter that with the spider config however.
--
-Cheetah
"Reality is that which, when you stop believing in it, doesn't go away".
-- Philip K. Dick
GPG pubkey fingerprint: A57F B354 FD30 A502 795B 9637 3EF1 3F22 A85E 2AD1
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Aug 29 16:15:05 2007