Skip to main content.
home | support | download

Back to List Archive

Re: input conversion failed

From: J Robinson <jrobinson852(at)not-real.yahoo.com>
Date: Tue Oct 28 2003 - 14:29:32 GMT
Hello Bill and everyone,

--- moseley@hank.org wrote:
> > >
> http://www.gnu.org/testimonials/testimonials.ca.html
> > > > input conversion failed due to input error
> > > > Bytes: 0xC4 0x3C 0x2F 0x41
> > > 
> > > Ok, how are you indexing?
> > 
> > -S prog method. The prog is in perl.
> 
> If you try the way I have it below do you also get
> the error?
> 
> > > moseley@bumby:~$ wget
> > >
> http://www.gnu.org/testimonials/testimonials.ca.html
> > > 2>/dev/null
> > > moseley@bumby:~$ swish-e -i testimonials.ca.html
> -v0


Interestingly, I don't get the error then (i'm using
tcsh):

[/tmp]% wget
http://www.gnu.org/testimonials/testimonials.ca.html >
& /dev/null
[/tmp]% ls  testimonials.ca.html 
testimonials.ca.html
[/tmp]% swish-e -i testimonials.ca.html -v0
(no output).

Same results with
http://www.openbsd.com/ko/donations.html 

> > Which distribution and version of linux are you
> using?
> 
> I tried it on two Debian Sid machines (2.4.21,
> libxml2 2.5.11)
> and a Debian Woody 2.4.20, libxml2 2.4.19).
> 
> In your -S prog are you using any regular
> expressions on the content?
> Or decoding any HTML entities?  

No, and no. It just gets the data out of a database,
wraps it in appropriate headers, and pipes it to
swish-e. Or at least I don't ask it to do any
conversions or regexes on the content! :)

I'll email you the relevant scripts offline for your
testing.

> My before-coffee-guess is that Perl making some
> conversion.  I had an 
> interesting problem once where I was using Perl to
> split up some text.
> IIRC, I had HTML entities that were forcing Perl
> into UTF-8 mode, but 
> the split I was using ended up splitting the text
> right in the middle of 
> a multi-byte UTF-8 character.  Then I was ending up
> with broken 
> characters.
> 
>   http://swish-e.org/archive/5049.html

Sounds reasonable. Perhaps perl is doing something
'bad'. I'm using perl 5.6.1.

> Is your Perl script something I can try on my
> machines?  Or perhaps you 
> can create a small test case?

We'll send you this offlist.
  
> > Let me know if you want more data points and I'll
> get
> > them for you. For example, I can try building the
> > index on a RH7.2 machine (it currently has libxml2
> > 2.4.19 installed) or with another libxml2 version.
> 
> I really need to spend more time thinking about
> character encodings.  
> For example, I'm not clear if/how to get libxml2 to
> say what encoding it 
> has determined the source doc to be in.  Might be
> helpful to see what 
> encoding it thinks your Perl program is generating
> (even though it says 
> 8859-1 in the <head>).  Another pre-coffee thought
> is maybe Perl is 
> converting something int utf-8 but libxml2 is
> expecting 8859-1 from the 
> charset setting.
> 
> Please post back your findings.
> 
> Thanks,
> -- 
> Bill Moseley
> moseley@hank.org
> 

Thanks for your help debugging this, Bill.

__________________________________
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
http://launch.yahoo.com/promos/britneyspears/
Received on Tue Oct 28 14:41:53 2003