Skip to main content.
home | support | download

Back to List Archive

Re: input conversion failed

From: J Robinson <jrobinson852(at)not-real.yahoo.com>
Date: Sun Oct 26 2003 - 17:46:29 GMT
--- Bill Moseley <moseley@hank.org> wrote:
> On Sat, Oct 25, 2003 at 05:51:51PM -0700, J Robinson
> wrote:
> > Hello Everyone;
> > 
> > Sometimes when indexing HTML using the HTML2
> backend,
> > I get messages like these from SWISH-E:
> > 
> > input conversion failed due to input error
> > Bytes: 0x25 0x00 0x61 0x3E
> 
> That's a message generated by libxml2, not by
> swish-e.  Code in swish-e
> causes it to print, so there should be a way to
> print the file.
> 
> > I know that it's multi-byte files  that are
> causing
> > the errors. Does anyone know  if there's an easy
> > workaround to avoid getting these, for example, to
> > detect that a file is multi-byte in your -S prog
> and
> > not index it?
> 
> I wonder if it's more a problem of libxml2 not
> figuring out the encoding
> correctly -- or perhaps truly an invalid sequence of
> bytes for the given
> encoding.  How to deal with it probably depends on
> what the problem is.
> 

It seems that Korean, japanese, and other asian pages
are especially likely to cause the error (no surprise
there). I found some publicly available examples:

http://www.openbsd.com/ko/donations.html
input conversion failed due to input error
Bytes: 0xB8 0x00 0x20 0xBE

But even some 'english' pages exhibit the error:

http://www.gnu.org/testimonials/supported.html
input conversion failed due to input error
Bytes: 0xC4 0x3C 0x2F 0x41

Any ideas on the best way to detect and ignore
multi-byte content?

Best, 
  jrobinson

[Also, it would be cool if the SWISH-E code that shows
the warning from libxml also indicated which docpath
caused the error.]


__________________________________
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
http://launch.yahoo.com/promos/britneyspears/
Received on Sun Oct 26 17:58:42 2003