Skip to main content.
home | support | download

Back to List Archive

Re: I am trying to index only <div id="content">

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Mar 08 2004 - 21:52:54 GMT
On Mon, Mar 08, 2004 at 04:42:43PM -0500, Matthew Slocum wrote:
> >I'd use -S prog and use either HTML::Parser or HTML::TreeBuilder to
> >extract out that content.
> I use -S when I run the spider.  Where do I use the regular expression or HTML::Parser?  

You use -S to run the old spider (swishspider).

If you use -S prog you can run any program to feed docs to swish-e.
If you search the swish-e docs online note that you get hits to specific
sections.  When indexing I break the source page into chunks and then
for each chunk wrap it in <html> and </html> and add in a title and
such.  Look in the distribution at html/split.pl and you see how that's
done.

Or you can also look at prog-bin/index_hypermail.pl and you can see both
use of HTML::TreeBuilder.  Wait -- you may have to browse the cvs
repository (off the home page) to see that usage.  index_hypermail.pl
doesn't really use HTML::TreeBuilder because it's so slow, but it should
give you some ideas.  Basically, HTML::TreeBuilder creates the HTML tree
in memory and then you ask it to return you the branch you are
interested in (in your case the div where the id attribute is "content".

Hope that helps.

-- 
Bill Moseley
moseley@hank.org
Received on Mon Mar 8 13:52:55 2004