From: Bill Moseley <moseley(at)>
Date: Tue Feb 03 2004 - 15:25:50 GMT
On Mon, Feb 02, 2004 at 10:02:38PM -0800, Peter Karman wrote:

> The difference seems to be that the XML2 version splits words on tags,
> while the HTML2 parser does not.

That might be true in some cases.  It's been discussed on the list 
before how to deal with 


is that one or two words?

> -h[option]

Someone still uses -h?

> and the files have used been indexed with XML2, they won't get a hit. 
> But if the files have been indexed with HTML2, they do.

Right.  There's special handling done for HTML files.  I'm not exactly 
sure how libxml2 parses HTML differently -- it likely allows for the way 
HTML is written, and it does provided the htmlTagLookup() function.
Swish has special handling of <head>, <title>, <body>, <h\d>, <em>, <b>, 
and <strong> -- that's what sets the "structure" flags in the document.  
"structure" is also used in ranking.  There's also special handling of 
href and src in <a> and <img>, IIRC -- you can index links.
> I guess my question is: should the HTML and XML versions really act so 
> differently?

Yes and no.  Swish-e has a lot of history indexing HTML documents.  But 
it would be nice to have a more general approach where you can list tags 
that are "special" -- like <title> should be ranked much higher, and 
<title> should be indexed with the <body>.

> I also found this gem:
> which leads me to believe that Bill has dealt with this already and has 
> something authoritative to say. ;)

Notice that nobody replied to that article?  I have a good chunk of 
those unanswered questions hanging around the Internet.

> I looked at parser.c and it looks like there are two different functions 
> called, one each for HTML and XML (htmlCreatePushParserCtxt and 
> xmlCreatePushParserCtxt) -- does this mean the issue is with libxml2 and 
> I should just suck it up and use some kind of preprocessor to strip out 
> the inline tags? I am using libxml2 2.6.4.

No, did you look at the code in check_html_tag()?

Bill Moseley
Received on Tue Feb 3 07:26:05 2004