Skip to main content.
home | support | download

Back to List Archive

Re: HTML2 problem again

From: Bill Moseley <moseley(at)>
Date: Wed Feb 06 2002 - 04:03:10 GMT
At 12:41 PM 02/05/02 -0800, Rich Thomas wrote:
>We're pulling data from a mainframe database and converting it into vanilla
>html.  And unless I can identify the character I can't remove it.

Hi Rich,

As Darryl pointed out, it's reasonably broken html.  Since you are
generating from a database you should be able to filter and correctly
format the html.

I'm curious: Do you need to generate HTML at all?  If you don't have a huge
amount of traffic would it be possible to dynamically generated the HTML
for viewing?  Then for indexing with swish use a -S prog program to extract
out the data from the database and send it directly to swish.

This isn't much help, but when I first added libxml2 to swish I had a
number of problems with libxml2 hanging (that's what's happening in this
case).  That shouldn't happen even with really bad input, so you probably
could call it a libxml2 bug.

Again the best solution is to fix up your HTML, but I'll also try to find
time to write a test program with libxml2 and post it to the libxml list
and duck for cover when they see the input source ;)

Bill Moseley
Received on Wed Feb 6 04:03:51 2002