--- Bill Moseley <moseley@hank.org> wrote:
> On Sat, Oct 25, 2003 at 05:51:51PM -0700, J Robinson
> wrote:
> > Hello Everyone;
> >
> > Sometimes when indexing HTML using the HTML2
> backend,
> > I get messages like these from SWISH-E:
> >
> > input conversion failed due to input error
> > Bytes: 0x25 0x00 0x61 0x3E
>
> That's a message generated by libxml2, not by
> swish-e. Code in swish-e
> causes it to print, so there should be a way to
> print the file.
>
> > I know that it's multi-byte files that are
> causing
> > the errors. Does anyone know if there's an easy
> > workaround to avoid getting these, for example, to
> > detect that a file is multi-byte in your -S prog
> and
> > not index it?
>
> I wonder if it's more a problem of libxml2 not
> figuring out the encoding
> correctly -- or perhaps truly an invalid sequence of
> bytes for the given
> encoding. How to deal with it probably depends on
> what the problem is.
>
It seems that Korean, japanese, and other asian pages
are especially likely to cause the error (no surprise
there). I found some publicly available examples:
http://www.openbsd.com/ko/donations.html
input conversion failed due to input error
Bytes: 0xB8 0x00 0x20 0xBE
But even some 'english' pages exhibit the error:
http://www.gnu.org/testimonials/supported.html
input conversion failed due to input error
Bytes: 0xC4 0x3C 0x2F 0x41
Any ideas on the best way to detect and ignore
multi-byte content?
Best,
jrobinson
[Also, it would be cool if the SWISH-E code that shows
the warning from libxml also indicated which docpath
caused the error.]
__________________________________
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
http://launch.yahoo.com/promos/britneyspears/
Received on Sun Oct 26 17:58:42 2003