Forgive me, please, if someone else has pointed this out or if this is a
known issue. I don't recall seeing this in the archive. This is also a
bit long, but I tried to include the relevant examples.
I have a test doc. I parse it with libxml2. If I specifically tell
swish-e to use the XML2 parser, I get different results than if I let it
default to HTML2.
The difference seems to be that the XML2 version splits words on tags,
while the HTML2 parser does not. The result? In the example below, if a
user searches for:
-h[option]
and the files have used been indexed with XML2, they won't get a hit.
But if the files have been indexed with HTML2, they do.
I guess my question is: should the HTML and XML versions really act so
differently? I know the obvious answer is "use HTML2 for HTML docs, and
vice versa" but my concern is that this spacing issue may throw off my
indexing of XML docs, since (as in this example) the same search has two
different results, depending on the source format.
I googled for this and found
http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/1777108
which leads me to believe that parsers work the same way.
I also found this gem:
http://mail.gnome.org/archives/xml/2001-September/msg00118.html
which leads me to believe that Bill has dealt with this already and has
something authoritative to say. ;)
I looked at parser.c and it looks like there are two different functions
called, one each for HTML and XML (htmlCreatePushParserCtxt and
xmlCreatePushParserCtxt) -- does this mean the issue is with libxml2 and
I should just suck it up and use some kind of preprocessor to strip out
the inline tags? I am using libxml2 2.6.4.
====================================
karpet@cartermac 212% cat config
WordCharacters
0123456789abcdefghijklmnopqrstuvwxyz._-/#()+{}[]%!&=$;:'<>?\|@^
BeginCharacters
0123456789abcdefghijklmnopqrstuvwxyz._-/#()+{}[]%!&=$;:'<>?\|@^
EndCharacters 0123456789abcdefghijklmnopqrstuvwxyz_-/#()+{}[]%!&=$;:'<>?\|@^
MinWordLimit 1
#IndexContents XML* .xml .html
karpet@cartermac 213% cat test.html
<html>
<a href="some/link.html">testing 123</a>
-h<tt class="literal">[access]</tt> paramto_the_option
<tt CLASS="literal">-h [<span CLASS="optional">yes</span>]aggress</tt>
<tt CLASS="literal">-h <span CLASS="optional">[no]</span>aggress</tt>
</html>
karpet@cartermac 214% swish-e -i test.html -T PARSED_WORDS -c config -v 3
Parsing config file 'config'
Indexing Data Source: "File-System"
Indexing "test.html"
Checking file "test.html"...
test.html - Using DEFAULT (HTML2) parser - White-space found word
'testing'
White-space found word '123'
White-space found word '-h[access]'
White-space found word 'paramto_the_option'
White-space found word '-h'
White-space found word '[yes]aggress'
White-space found word '-h'
White-space found word '[no]aggress'
(8 words)
karpet@cartermac 216% swish-e -i test.html -T PARSED_WORDS -c config -v 3
Parsing config file 'config'
Indexing Data Source: "File-System"
Indexing "test.html"
Checking file "test.html"...
test.html - Using XML2 parser - White-space found word 'testing'
White-space found word '123'
White-space found word '-h'
White-space found word '[access]'
White-space found word 'paramto_the_option'
White-space found word '-h'
White-space found word '['
White-space found word 'yes'
White-space found word ']aggress'
White-space found word '-h'
White-space found word '[no]'
White-space found word 'aggress'
(12 words)
========================================================
Moreover, if I use non-HTML tags in my test doc, and the HTML2 parser is
used, I get still different results. libxml2 does indeed seem to parse
HTML against the HTML DTD:
karpet@cartermac 239% xmllint test.html
<?xml version="1.0"?>
<html>
<a href="some/link.html">testing 123</a>
-h<tt class="literal">[access]</tt> paramto_the_option
<tt CLASS="literal">-h [<span CLASS="optional">yes</span>]aggress</tt>
<notag CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</notag>
</html>
karpet@cartermac 240% xmllint --html test.html
test.html:6: HTML parser error : Tag notag invalid
<notag CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</notag>
^
test.html:6: HTML parser error : Tag foo invalid
<notag CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</notag>
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="some/link.html">testing 123</a><p>
-h<tt class="literal">[access]</tt> paramto_the_option
<tt class="literal">-h [<span class="optional">yes</span>]aggress</tt>
<notag class="literal">-h <foo
class="optional">[no]</foo>aggress</notag></p>
</body></html>
===================
and here's what swish-e gives me (note that swish-e seems to see one
more word when non-HTML tags are used...):
karpet@cartermac % cat test.html
<html>
<a href="some/link.html">testing 123</a>
-h<tt class="literal">[access]</tt> paramto_the_option
<tt CLASS="literal">-h [<span CLASS="optional">yes</span>]aggress</tt>
<bar CLASS="literal">-h <foo CLASS="optional">[no]</foo>aggress</bar>
</html>
karpet@cartermac % swish-e -i test.html -T PARSED_WORDS -c config -v 3
Checking file "test.html"...
test.html - Using DEFAULT (HTML2) parser - White-space found word
'testing'
White-space found word '123'
White-space found word '-h[access]'
White-space found word 'paramto_the_option'
White-space found word '-h'
White-space found word '[yes]aggress'
White-space found word '-h'
White-space found word '[no]'
White-space found word 'aggress'
(9 words)
--
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Mon Feb 2 22:03:59 2004