If you have access to them, you could hack all your html documents up with search-and-replace to transform the comments to an xmlish tag. From the command line, these perl one-liners will do the trick:
perl -pi.bak -e 's~<!-- Content starts here -->~<content>~gi' *.html
perl -pi.bak -e 's~<!-- end of content -->~</content>~gi' *.html
or to recursively get everything in subdirectories too, combine with 'find':
perl -pi.bak -e 's~<!-- Content starts here -->~<content>~gi' `find . -name "*.html"`
The .bak part tells Perl to make a backup of every document, eg. foo.html.bak .
You could tweak the regular expression too if not all the comments in question are spelled identically.
Then the only trick would be getting your authors to use the xmlish tags rather than the comments :)
>>> <email@example.com> 07/10/03 04:02PM >>>
> We want to index some sites that have marked up certain sections of
> their content with HTML comments (unfortunately).
No, not easily. I was going to suggest if you are good at Perl that you
use HTML::Parser and translate comments to a new tag upon indexing, but it
would probably be easier to modify src/parser.c to add comments as a
property as well as index. But that might take a while unless someone can
figure it out and submit a patch.
Received on Fri Jul 11 17:17:10 2003