Skip to main content.
home | support | download

Back to List Archive

Re: ignoring words inside form elements

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Apr 12 2001 - 04:37:55 GMT
By the way,

If you do decide to parse and filter the html, here's some perl code you
can plug into the example spider.pl's configuration (SwishSpiderConfig.pl)
to strip out the <select> tags and their contents.  This is only available
in the development version of swish, of course.
(http://sunsite.berkeley.edu:4444/swish-daily/)

If you are not spidering a web server, then you can use $tree->parse_file()
instead and use the DirTree.pl example in the prog-bin directory for ideas.

I didn't run any long tests, but it did seem to add a bit of time to the
indexing (50% more?).  Parsing HTML isn't fast.  Maybe a regular expression
would be faster?  Of course, if you know what files have select tags then
you can just parse those.

Anyway, add this someplace in SwishSpiderConfig.pl (found in prog-bin):

use HTML::TreeBuilder;
sub no_select {
    my ( $uri, $server, $response, $content_ref ) = @_;

    # Only deal with html pages
    return 1 unless $response->content_type eq 'text/html';

    my $tree = HTML::TreeBuilder->new;
    $tree->store_comments(1); # index comments?
    $tree->parse( $$content_ref );
    $tree->eof;
    $_->delete for $tree->find_by_tag_name('select');
    $$content_ref = $tree->as_HTML;
    $tree->delete;
    return 1;
}

Then in SwishSpiderConfig.pl modify the parameters in the hash to do
something like this:

   filter_content  => [ \&pdf, \&doc, \&no_select ],

Which calls the three filters for each document.



Disclaimer: I didn't test very much...



Bill Moseley
mailto:moseley@hank.org
Received on Thu Apr 12 04:39:47 2001