Just wanted to let people know that I tracked this bug
down to being related to how I handle the data before
feeding it to SWISH-E.
Apparently some of the data is not getting to SWISH-E
as intended. I think the problem is not with SWISH-E
but with a library I'm using for my upstream
processing. I'll post back a summary when I figure out
exactly went wrong, if it seems relevant.
--- J Robinson <email@example.com> wrote:
> Hello Bill and everyone,
> --- firstname.lastname@example.org wrote:
> > > >
> > > > > input conversion failed due to input error
> > > > > Bytes: 0xC4 0x3C 0x2F 0x41
> > > >
> > > > Ok, how are you indexing?
> > >
> > > -S prog method. The prog is in perl.
> > If you try the way I have it below do you also get
> > the error?
> > > > moseley@bumby:~$ wget
> > > >
> > > > 2>/dev/null
> > > > moseley@bumby:~$ swish-e -i
> > -v0
> Interestingly, I don't get the error then (i'm using
> [/tmp]% wget
> & /dev/null
> [/tmp]% ls testimonials.ca.html
> [/tmp]% swish-e -i testimonials.ca.html -v0
> (no output).
> Same results with
> > > Which distribution and version of linux are you
> > using?
> > I tried it on two Debian Sid machines (2.4.21,
> > libxml2 2.5.11)
> > and a Debian Woody 2.4.20, libxml2 2.4.19).
> > In your -S prog are you using any regular
> > expressions on the content?
> > Or decoding any HTML entities?
> No, and no. It just gets the data out of a database,
> wraps it in appropriate headers, and pipes it to
> swish-e. Or at least I don't ask it to do any
> conversions or regexes on the content! :)
> I'll email you the relevant scripts offline for your
> > My before-coffee-guess is that Perl making some
> > conversion. I had an
> > interesting problem once where I was using Perl to
> > split up some text.
> > IIRC, I had HTML entities that were forcing Perl
> > into UTF-8 mode, but
> > the split I was using ended up splitting the text
> > right in the middle of
> > a multi-byte UTF-8 character. Then I was ending
> > with broken
> > characters.
> > http://swish-e.org/archive/5049.html
> Sounds reasonable. Perhaps perl is doing something
> 'bad'. I'm using perl 5.6.1.
> > Is your Perl script something I can try on my
> > machines? Or perhaps you
> > can create a small test case?
> We'll send you this offlist.
> > > Let me know if you want more data points and
> > get
> > > them for you. For example, I can try building
> > > index on a RH7.2 machine (it currently has
> > > 2.4.19 installed) or with another libxml2
> > I really need to spend more time thinking about
> > character encodings.
> > For example, I'm not clear if/how to get libxml2
> > say what encoding it
> > has determined the source doc to be in. Might be
> > helpful to see what
> > encoding it thinks your Perl program is generating
> > (even though it says
> > 8859-1 in the <head>). Another pre-coffee thought
> > is maybe Perl is
> > converting something int utf-8 but libxml2 is
> > expecting 8859-1 from the
> > charset setting.
> > Please post back your findings.
> > Thanks,
> > --
> > Bill Moseley
> > email@example.com
> Thanks for your help debugging this, Bill.
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
Received on Fri Nov 14 14:02:09 2003