I found several threads on word position in the archives, but none
specifically on HTML block tags. This is a follow up question to my
questions last week on the difference between using the HTML2 and XML2
parsers.
It appears that currently, using the HTML2 parser does not increment the
word position each time it reaches a HTML block tag. libxml2 defines
block tags as:
pre
p
div
dl
center
blockquote
etc. Also, all the h\d heading tags are included in that definition.
My question is: should phrase matching really work across something like
a <p> tag? Or across a <h\d> tag?
For example:
===============
<body>
<h1>some title</h1>
<p>some text</p>
</body>
===============
a phrase search for "title some" will match.
I realize that HTML is mostly tagged for what it should /look/ like and
not what it means, but this seems counterintuitive to me. I realize that
there are various config options to control some of the bumping features
(BumpPositionCounterCharaters, etc.), but these seem to ignore HTML tags
(which I assume, from staring at parser.c, are parsed prior to the
evaluation of the Bump).
In looking at the parser.c code, I see that it seems to be possible to
implement something like a BumpPositionCounteronHTMLBlocks (NO|yes)
config option or something like that, but before I jumped in and tried
to hack that bit, I wanted to throw it out there and see if there some
piece of logic that I'm missing.
Anyone?
thanks.
pek
--
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Mon Feb 9 07:46:09 2004