At 12:41 PM 02/05/02 -0800, Rich Thomas wrote:
>We're pulling data from a mainframe database and converting it into vanilla
>html. And unless I can identify the character I can't remove it.
As Darryl pointed out, it's reasonably broken html. Since you are
generating from a database you should be able to filter and correctly
format the html.
I'm curious: Do you need to generate HTML at all? If you don't have a huge
amount of traffic would it be possible to dynamically generated the HTML
for viewing? Then for indexing with swish use a -S prog program to extract
out the data from the database and send it directly to swish.
This isn't much help, but when I first added libxml2 to swish I had a
number of problems with libxml2 hanging (that's what's happening in this
case). That shouldn't happen even with really bad input, so you probably
could call it a libxml2 bug.
Again the best solution is to fix up your HTML, but I'll also try to find
time to write a test program with libxml2 and post it to the libxml list
and duck for cover when they see the input source ;)
Received on Wed Feb 6 04:03:51 2002