Hello Bill and everyone,
--- firstname.lastname@example.org wrote:
> > >
> > > > input conversion failed due to input error
> > > > Bytes: 0xC4 0x3C 0x2F 0x41
> > >
> > > Ok, how are you indexing?
> > -S prog method. The prog is in perl.
> If you try the way I have it below do you also get
> the error?
> > > moseley@bumby:~$ wget
> > >
> > > 2>/dev/null
> > > moseley@bumby:~$ swish-e -i testimonials.ca.html
Interestingly, I don't get the error then (i'm using
[/tmp]% ls testimonials.ca.html
[/tmp]% swish-e -i testimonials.ca.html -v0
Same results with
> > Which distribution and version of linux are you
> I tried it on two Debian Sid machines (2.4.21,
> libxml2 2.5.11)
> and a Debian Woody 2.4.20, libxml2 2.4.19).
> In your -S prog are you using any regular
> expressions on the content?
> Or decoding any HTML entities?
No, and no. It just gets the data out of a database,
wraps it in appropriate headers, and pipes it to
swish-e. Or at least I don't ask it to do any
conversions or regexes on the content! :)
I'll email you the relevant scripts offline for your
> My before-coffee-guess is that Perl making some
> conversion. I had an
> interesting problem once where I was using Perl to
> split up some text.
> IIRC, I had HTML entities that were forcing Perl
> into UTF-8 mode, but
> the split I was using ended up splitting the text
> right in the middle of
> a multi-byte UTF-8 character. Then I was ending up
> with broken
Sounds reasonable. Perhaps perl is doing something
'bad'. I'm using perl 5.6.1.
> Is your Perl script something I can try on my
> machines? Or perhaps you
> can create a small test case?
We'll send you this offlist.
> > Let me know if you want more data points and I'll
> > them for you. For example, I can try building the
> > index on a RH7.2 machine (it currently has libxml2
> > 2.4.19 installed) or with another libxml2 version.
> I really need to spend more time thinking about
> character encodings.
> For example, I'm not clear if/how to get libxml2 to
> say what encoding it
> has determined the source doc to be in. Might be
> helpful to see what
> encoding it thinks your Perl program is generating
> (even though it says
> 8859-1 in the <head>). Another pre-coffee thought
> is maybe Perl is
> converting something int utf-8 but libxml2 is
> expecting 8859-1 from the
> charset setting.
> Please post back your findings.
> Bill Moseley
Thanks for your help debugging this, Bill.
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
Received on Tue Oct 28 14:41:53 2003