I am using Perl 5.8.4 on a Solaris box. The spider is v1.26.
I first ran into this problem when indexing a very old internal web page
(with no doctype or charset) which contained a multi-byte character. Later on
when indexing the first 2 levels of our corporate site at www.esa.int I had
the same problem.
email@example.com wrote on 29/03/2007 17:42:49:
> On Thu, Mar 29, 2007 at 03:51:24PM +0200, Clint wrote:
> > Just to let you know, the following code:
> > my $bytecount = length($$content);
> Interesting, I would expect that not to work.
> What version of Perl are you using?
> Also, which version of the spider? There's a version number in the
> file itself.
> I recently updated the spider to deal better (I hope) with character
> encodings. So, I'm curious if there's a problem with the new code.
> I'm also curious if it's maybe your server reporting the incorrect
> The spider is suppose to look at the character encoding reported by
> the web server (or in a meta tag in the web page) and decode that into
> Perl's internal character encoding. The length() function, as you
> have it above, should report the number of *characters* not bytes,
> which would not be the same if there are multi-byte characters.
> Is it possible you are indexing utf8 source but the web server is
> reporting it as an eight-bit encoding? I'm not sure if decoding utf8
> as latin1 would generate a warning.
> Bill Moseley
> Unsubscribe from or help with the swish-e list:
> Help with Swish-e:
> Users mailing list
Users mailing list
Received on Thu Mar 29 12:11:06 2007