I reproduced your example and got it working with my setup and the t.html below (eliminating the error by making the case of div consistent.).
When I tried it with my actual html pages, I ran into some further errors. I've isolated (at least the first ones) to the lack of correct xml usage with the meta tag.
<meta name="keywords" content="John and Jane">
If I add, </meta> (or a />), then it works, indicating that the parser doesn't like the regular html way.
Am I stuck here with having to convert all pages over to strict xhtml in order to be able to use the XML2 parser and grab the class attribute, or is an external program the only way? Since I'm indexing a few million pages averaging 50K each, I'd like to avoid the extra overhead of running them all through an external program each time to reformat them for the indexer.
Of course, the more I mess with trying to make one of the pages strict xhtml so that it will process, the better writing an external program sounds...
From: Bill Moseley [mailto:email@example.com]
Sent: Fri 3/12/2004 4:18 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Searching only a specific div class
Yes, that's a feature of the XML parser. They are the same parser,
really, but there's just a check to see if parsing HTML and if so skip
the part that deals with XML attributes. Might be able to modify
parser.c to make it work with HTML, too -- there's just a lot of
attributes in normal html.
I think libxml2 is more forgiving when parsing HTML, for one thing. But
I'm not really clear on the differences in the parsers internal to
Now the other problem is the UndefinedMetaTags ignore is a bit too
agressive. It ignores everything until the closing tag -- even if you
have a tag defined inbetween. That behavior is questionable.
My suggestion is to use an program to extract out the data you want
Anyway, here's your example:
Received on Sat Mar 13 07:32:36 2004