On Sat, Mar 13, 2004 at 07:31:35AM -0800, Thomas Sewell wrote:
> Am I stuck here with having to convert all pages over to strict xhtml
> in order to be able to use the XML2 parser and grab the class
> attribute, or is an external program the only way? Since I'm indexing
> a few million pages averaging 50K each, I'd like to avoid the extra
> overhead of running them all through an external program each time to
> reformat them for the indexer.
I agree. A sax parser might not be that slow, but why run it through an
extra parser? Are you willing to hack on the source? You might be able
to get the behaviour you wish by just changing a few lines of code in
parser.c. Take advantage of it being open source.
This single change seems to make it work when using the HTML parser:
/* Index the content of attributes */
- if ( !parse_data->parsing_html && attr )
+ if ( attr )
int class_found = 0;
I didn't test much, so it would be wise to try a few documents and use
-T options to check what's getting indexed.
Received on Sat Mar 13 13:58:25 2004