Skip to main content.
home | support | download

Back to List Archive

Re: input conversion failed

From: <moseley(at)not-real.hank.org>
Date: Mon Oct 27 2003 - 14:41:35 GMT
> > http://www.gnu.org/testimonials/testimonials.ca.html
> > > input conversion failed due to input error
> > > Bytes: 0xC4 0x3C 0x2F 0x41
> > 
> > Ok, how are you indexing?
> 
> 
> -S prog method. The prog is in perl.

If you try the way I have it below do you also get the error?

> > moseley@bumby:~$ wget
> > http://www.gnu.org/testimonials/testimonials.ca.html
> > 2>/dev/null
> > moseley@bumby:~$ swish-e -i testimonials.ca.html -v0

> Which distribution and version of linux are you using?

I tried it on two Debian Sid machines (2.4.21, libxml2 2.5.11)
and a Debian Woody 2.4.20, libxml2 2.4.19).

In your -S prog are you using any regular expressions on the content?
Or decoding any HTML entities?  

My before-coffee-guess is that Perl making some conversion.  I had an 
interesting problem once where I was using Perl to split up some text.
IIRC, I had HTML entities that were forcing Perl into UTF-8 mode, but 
the split I was using ended up splitting the text right in the middle of 
a multi-byte UTF-8 character.  Then I was ending up with broken 
characters.

  http://swish-e.org/archive/5049.html

Is your Perl script something I can try on my machines?  Or perhaps you 
can create a small test case?  

> Let me know if you want more data points and I'll get
> them for you. For example, I can try building the
> index on a RH7.2 machine (it currently has libxml2
> 2.4.19 installed) or with another libxml2 version.

I really need to spend more time thinking about character encodings.  
For example, I'm not clear if/how to get libxml2 to say what encoding it 
has determined the source doc to be in.  Might be helpful to see what 
encoding it thinks your Perl program is generating (even though it says 
8859-1 in the <head>).  Another pre-coffee thought is maybe Perl is 
converting something int utf-8 but libxml2 is expecting 8859-1 from the 
charset setting.

Please post back your findings.

Thanks,


-- 
Bill Moseley
moseley@hank.org
Received on Mon Oct 27 14:53:52 2003