>I'd use -S prog and use either HTML::Parser or HTML::TreeBuilder to
>extract out that content.
I use -S when I run the spider. Where do I use the regular expression or HTML::Parser?
Matt Slocum
>>> Bill Moseley <moseley@hank.org> 03/08/04 04:35PM >>>
On Mon, Mar 08, 2004 at 01:15:44PM -0800, Matthew Slocum wrote:
> I am trying to index only <div id="content">
> I think it is giving me all the div tags.
>
> in swish.conf:
> StoreDescription HTML "<div id=\"content\">"
No that won't work, sorry.
I'd use -S prog and use either HTML::Parser or HTML::TreeBuilder to
extract out that content.
You might be able to use a regular expression extract out the content,
although using regular expressions to parse HTML can be hard. But that
would be much faster than HTML::Parser or HTML::TreeBuilder.
--
Bill Moseley
moseley@hank.org
Received on Mon Mar 8 13:44:24 2004