Skip to main content.
home | support | download

Back to List Archive

Searching sub-sections of documents (or Why use Perl)

From: Bill Moseley <moseley(at)>
Date: Thu Jan 31 2002 - 00:21:51 GMT
Here's one good reason to use perl.

I'm indexing a site that's mostly documentation.  It uses a very similar program to create the web pages from .pod files as is used to create the swish-e 2.2 HTML docs.

The resulting pages can be quite long.  So searching them is not very useful -- you search for something and swish tells you what you are looking for is somewhere on this 100K web page.

If you have ever used the little search feature for searching the 2.2 docs on the site you know that the results take you to "split" pages, which are small html pages, one for each section of the docs.  It's good since a search will take you directly to the section you are looking for.  But that's a pain since then you can't see the context of the document (you only see one little section).

So, what would be nice is if swish could somehow index web pages in sections.  That is, you do a search and instead of returning a hit for the page

it returns


Anyway, it would be (somewhat?) hard to make swish do that.  But with perl it was trivial to extend to do that.

The documentation in question is generated from templates, so it was easy to modify the way the HTML is generated.  So I modified the template to wrap each section in  <div class="index_section">.  That gives me something to detect the sections when spidering.

 Each section also starts with a

      <a name="This_is_some_section">

So not only could I make swish return a path that would jump right to that section, I could also display that in the title.  In other words, if the title of the page was "Page Foo", swish will now return in results:

     "Page Foo: This is some section"

Kind of cool, no?

With perl, I just use the HTML::Parser modules from CPAN, and all the work is done for me.  

So, I can extend what can do completely within the configuration file:

So, I added this to the config file:

      filter_content => \&split_page,

Which tells the spider to call the "split_page()" subroutine for every document.  (This is how pdf, and MS Word docs are converted, too.)

And here's all the code needed to do this:

sub split_page {

    my %params;
    @params{ qw/ uri server response content / } = @_;
    $params{found} = 0;

    my $tree = HTML::TreeBuilder->new;
    $tree->parse( ${$params{content}} ); 

    # grab the <head> section
    my $head = $tree->look_down( '_tag', 'head' );

    # now grab each <div class="index_section> section
    # and index it
    for my $section ( $tree->look_down( '_tag', 'div', 'class', 'index_section' ) ) {
        create_page( $head->clone, $section->clone, \%params )

    $tree->delete;  # clean up

    return !$params{found};  # tell to not index the page

# This builds a new HTML page from the <head> and <div> sections
# and sends it to swish for indexing

sub create_page {
    my ( $head, $section, $params ) = @_;

    my $uri = $params->{uri};

    # Grab the title of this section from <a name="foo">
    my $section_name = 'Unknown_Section';
    my $name = $section->look_down( '_tag', 'a', sub { defined($_[0]->attr('name')) } );

    if ( $name ) {
        $section_name = $name->attr('name');
        $uri->fragment( $section_name );
    my $text_title = $section_name;
    $text_title =~ tr/_/ /s;

    my $title = $head->look_down('_tag', 'title');

    if ( $title ) {
        $title->push_content(": $text_title");
    } else {
        my $title = HTML::Element->new('title');
        $title->push_content( $text_title );
        $head->push_content( $title );

    # now, build a new HTML page
    my $body = HTML::Element->new('body'); # <body>
    my $doc  = HTML::Element->new('html'); # <html>

    $body->push_content( $section );
    $doc->push_content( $head, $body );

    my $new_content = $doc->as_HTML(undef,"\t");
    output_content( $params->{server}, \$new_content, $uri, $params->{response} );

    # clean up
    $params->{found}++; # say that we found a section and it was indexed

Bill Moseley
Received on Thu Jan 31 00:22:18 2002