Skip to main content.
home | support | download

Back to List Archive

Re: AW: Re: AW: Re: Getting a description out of the

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Mar 27 2002 - 19:18:32 GMT
At 10:30 AM 03/27/02 -0800, Markus Strickler wrote:
>Ah, the file is not valid HTML.
>OK, I'll go an bash our HTML coders for a while... ;-)
>
>So the file must be valid for the HTML2 parser to work....
>And it must be well-formed for the XML parser to work.
>But the HTML parser ignores all this, correct?

The libxml2 parser is better.  The HTML parser may continue right along
without complaining about bad HTML, but index the wrong words. 

The HTML2 parser, finds the mistakes and does the best it can, and often
fixes structure (like adding opening/closing tags to some degree).  In your
case, when it hit <form> it added it's own body tag, since <form> must be
inside of <body>.

Frankly, I think it's incorrect for libxml2 to have returned the attributes
as plain text.  But, I would get eaten alive if I posted to the libxml2
list that libxml2 is *incorrectly* parsing *bad* HTML.

The HTML parser works well for most documents.  But it does get confused.

It is a very good exercise to take a few thousand HTML docs, and index with
both HTML and with HTML2 and diff the words:

Here's just a few files, on HTML that's been generated by a program (not
hand created):

~/swish-e/src > echo "DefaultContents HTML2" > c

~/swish-e/src > ./swish-e -c c -i ../html/*.html -f index_HTML -v0    
Indexing Data Source: "File-System"
Indexing done!

~/swish-e/src > echo "DefaultContents HTML2" > c

~/swish-e/src > ./swish-e -c c -i ../html/*.html -f index_HTML2 -v0
Indexing Data Source: "File-System"
Indexing done!

~/swish-e/src > ./swish-e -f index_HTML -T index_words_only > words_HTML
~/swish-e/src > ./swish-e -f index_HTML2 -T index_words_only > words_HTML2

> diff words_HTML words_HTML2   
192a193
> anding 

That's in two documents like this:

    <STRONG>and</STRONG>ing

HTML parser indexed as "and" and "ing".  The HTML2 parser indexed it as one
word.


702d702
< eb

Again, HTML indexed it as "w" and "eb".

<STRONG>W</STRONG>eb


1058a1059
> humans

<STRONG>H</STRONG>umans


1085d1085
< imple
1440d1439
< ndexing
1457d1455
< nhanced
1475a1474
> notecurrently  

Ah, look the HTML parser actually "fixed" my mistake in
./html/SWISH-PERL.html

<STRONG>NOTE</STRONG>Currently swish will exit

Sometimes HTML is better ;)

1543a1543
> oring
1789a1790
> quot

Hum, why is libxml2 indexing an HTML entity as plain text?

Yep, sure enough - 

Line 105, column 33: 
  ...    <pre>    ReplaceRules append &quot;foo bar&quot;   &lt;-  ...
                                   ^Error: unknown entity "quot"



2351d2351
< umans
2526d2525
< ystem
-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Mar 27 19:18:34 2002