Skip to main content.
home | support | download

Back to List Archive

Re: AW: Re: AW: Re: Getting a description out of the

From: Bill Moseley <moseley(at)>
Date: Wed Mar 27 2002 - 19:18:32 GMT
At 10:30 AM 03/27/02 -0800, Markus Strickler wrote:
>Ah, the file is not valid HTML.
>OK, I'll go an bash our HTML coders for a while... ;-)
>So the file must be valid for the HTML2 parser to work....
>And it must be well-formed for the XML parser to work.
>But the HTML parser ignores all this, correct?

The libxml2 parser is better.  The HTML parser may continue right along
without complaining about bad HTML, but index the wrong words. 

The HTML2 parser, finds the mistakes and does the best it can, and often
fixes structure (like adding opening/closing tags to some degree).  In your
case, when it hit <form> it added it's own body tag, since <form> must be
inside of <body>.

Frankly, I think it's incorrect for libxml2 to have returned the attributes
as plain text.  But, I would get eaten alive if I posted to the libxml2
list that libxml2 is *incorrectly* parsing *bad* HTML.

The HTML parser works well for most documents.  But it does get confused.

It is a very good exercise to take a few thousand HTML docs, and index with
both HTML and with HTML2 and diff the words:

Here's just a few files, on HTML that's been generated by a program (not
hand created):

~/swish-e/src > echo "DefaultContents HTML2" > c

~/swish-e/src > ./swish-e -c c -i ../html/*.html -f index_HTML -v0    
Indexing Data Source: "File-System"
Indexing done!

~/swish-e/src > echo "DefaultContents HTML2" > c

~/swish-e/src > ./swish-e -c c -i ../html/*.html -f index_HTML2 -v0
Indexing Data Source: "File-System"
Indexing done!

~/swish-e/src > ./swish-e -f index_HTML -T index_words_only > words_HTML
~/swish-e/src > ./swish-e -f index_HTML2 -T index_words_only > words_HTML2

> diff words_HTML words_HTML2   
> anding 

That's in two documents like this:


HTML parser indexed as "and" and "ing".  The HTML2 parser indexed it as one

< eb

Again, HTML indexed it as "w" and "eb".


> humans


< imple
< ndexing
< nhanced
> notecurrently  

Ah, look the HTML parser actually "fixed" my mistake in

<STRONG>NOTE</STRONG>Currently swish will exit

Sometimes HTML is better ;)

> oring
> quot

Hum, why is libxml2 indexing an HTML entity as plain text?

Yep, sure enough - 

Line 105, column 33: 
  ...    <pre>    ReplaceRules append &quot;foo bar&quot;   &lt;-  ...
                                   ^Error: unknown entity "quot"

< umans
< ystem
Bill Moseley
Received on Wed Mar 27 19:18:34 2002