On Mon, Feb 02, 2004 at 10:02:38PM -0800, Peter Karman wrote:
> The difference seems to be that the XML2 version splits words on tags,
> while the HTML2 parser does not.
That might be true in some cases. It's been discussed on the list
before how to deal with
is that one or two words?
Someone still uses -h?
> and the files have used been indexed with XML2, they won't get a hit.
> But if the files have been indexed with HTML2, they do.
Right. There's special handling done for HTML files. I'm not exactly
sure how libxml2 parses HTML differently -- it likely allows for the way
HTML is written, and it does provided the htmlTagLookup() function.
Swish has special handling of <head>, <title>, <body>, <h\d>, <em>, <b>,
and <strong> -- that's what sets the "structure" flags in the document.
"structure" is also used in ranking. There's also special handling of
href and src in <a> and <img>, IIRC -- you can index links.
> I guess my question is: should the HTML and XML versions really act so
Yes and no. Swish-e has a lot of history indexing HTML documents. But
it would be nice to have a more general approach where you can list tags
that are "special" -- like <title> should be ranked much higher, and
<title> should be indexed with the <body>.
> I also found this gem:
> which leads me to believe that Bill has dealt with this already and has
> something authoritative to say. ;)
Notice that nobody replied to that article? I have a good chunk of
those unanswered questions hanging around the Internet.
> I looked at parser.c and it looks like there are two different functions
> called, one each for HTML and XML (htmlCreatePushParserCtxt and
> xmlCreatePushParserCtxt) -- does this mean the issue is with libxml2 and
> I should just suck it up and use some kind of preprocessor to strip out
> the inline tags? I am using libxml2 2.6.4.
No, did you look at the code in check_html_tag()?
As for the rest of your question... you will have to wait. My wife says
I have to make the coffee.
Received on Tue Feb 3 07:26:05 2004