By the way,
If you do decide to parse and filter the html, here's some perl code you
can plug into the example spider.pl's configuration (SwishSpiderConfig.pl)
to strip out the <select> tags and their contents. This is only available
in the development version of swish, of course.
(http://sunsite.berkeley.edu:4444/swish-daily/)
If you are not spidering a web server, then you can use $tree->parse_file()
instead and use the DirTree.pl example in the prog-bin directory for ideas.
I didn't run any long tests, but it did seem to add a bit of time to the
indexing (50% more?). Parsing HTML isn't fast. Maybe a regular expression
would be faster? Of course, if you know what files have select tags then
you can just parse those.
Anyway, add this someplace in SwishSpiderConfig.pl (found in prog-bin):
use HTML::TreeBuilder;
sub no_select {
my ( $uri, $server, $response, $content_ref ) = @_;
# Only deal with html pages
return 1 unless $response->content_type eq 'text/html';
my $tree = HTML::TreeBuilder->new;
$tree->store_comments(1); # index comments?
$tree->parse( $$content_ref );
$tree->eof;
$_->delete for $tree->find_by_tag_name('select');
$$content_ref = $tree->as_HTML;
$tree->delete;
return 1;
}
Then in SwishSpiderConfig.pl modify the parameters in the hash to do
something like this:
filter_content => [ \&pdf, \&doc, \&no_select ],
Which calls the three filters for each document.
Disclaimer: I didn't test very much...
Bill Moseley
mailto:moseley@hank.org
Received on Thu Apr 12 04:39:47 2001