Skip to main content.
home | support | download

Back to List Archive

Re: Science quotations formatted

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu May 20 2004 - 23:14:52 GMT
On Thu, May 20, 2004 at 08:05:05AM -0700, Carl Gaither wrote:
> Hello,
> 
> I have reformatted all of the data that I have to date.  It now looks 
> like the following:

I'm not sure how easy this is.  I didn't spend a lot of time and there's
things I would do differently.

One problem with Perl is you have to install modules.  How that is done
depends on your operating system.  I use Debian and it's trivial to
install new modules.  But for some, it's a pain.

I think the modules you would need are:

    Template  (Template-Toolkit)
    HTML::FillInForm
    XML::Simple

You will also need to install SWISH::API.  That's done after installing
swish-e.  Go to the "perl" directory of the distribution and look at the
README.

Perl is slow as a CGI script.  If you run under mod_perl or speedy it
will be much faster.  Depends on your load requirements.  Worry about
that later.


First the indexing.  

I used the Perl module XML::Simple which just loads the XML into a perl
data structure in memory.  I used this because, well, it's simple.  The
problem with this method is that it doesn't scale well.  A better module
to use would be XML::SAX which doesn't require loading all the data into
memory at the same time.

I'll include that short script below -- called parse_quotes.pl.  It
expects the input file to be called "quotes.xml".  Take a look at the
output generated from running that program (on a sample file).  Don't
forget you will likely need to wrap all your records in a single tag
like <xml>...-your records here-...</xml>.

The swish-e config file is simple:

    moseley@bumby:~/apache$ cat quotes.conf
    PropertyNames reference quote date copyright title author
    MetaNames quote author reference title

MetaNames allows for limiting the search by those fields.  PropertyNames
is used for storing the content.  The parse_quotes.pl program wraps
author, title, reference and quote in <swishdefault> tags -- which allow
for searching all four fields at once.

Indexing goes like this:

    moseley@bumby:~/apache$ perl parse_quotes.pl | swish-e -S prog -i stdin -c quotes.conf
    Indexing Data Source: "External-Program"
    Indexing "stdin"
    Removing very common words...
    no words removed.
    Writing main index...
    Sorting words ...
    Sorting 107 words alphabetically
    Writing header ...
    Writing index entries ...
      Writing word text: Complete
      Writing word hash: Complete
      Writing word data: Complete
    107 unique words indexed.
    10 properties sorted.
    3 files indexed.  1,198 total bytes.  304 total words.
    Elapsed time: 00:00:00 CPU time: 00:00:00
    Indexing done!

So the output of the parse_quotes.pl program is fed to swish-e for
indexing.

Ok, now the search script.  I modified search.cgi, which uses
the Template-Toolkit and HTML::FillInForm modules.  Search.cgi is a
skeleton script -- it doesn't know how to limit by metanames so I had to
add that feature.  Also, since swishdefault is made up of four fields I
had to add a way to make highlighting work when searching for author=foo
or swishdefault=foo.

I also added a radio group to allow limiting by one of the four
metanames.

Here's the two files, first parse_quotes.pl that parses your XML, and
then the modified search.cgi script (with documentation removed).

Note that the swish.cgi listed below is specific for my machine -- you
may need to change some paths.

------------------------- parse_quotes.pl -------------------------

#!/usr/bin/perl -w
use strict;
use XML::Simple;
use Data::Dumper;

my $data = XMLin( 'quotes.xml' );

# Uncomment to see how the data is read
# print Dumper $data;

my $count = 0;

output_record( $_ ) for @{ $data->{record} };

sub output_record {

    my ( $rec ) = @_;

    my $out_string = "<xml>\n<swishdefault>\n";
    for ( qw/ reference quote title author date copyright / ) {
        my $text = fetch_string ( $rec->{$_} );
        next unless $text;
        $out_string .= "<$_>$text</$_>\n";
        $out_string .= "</swishdefault>\n" if $_ eq 'author';
    }

    $out_string .= "</xml>\n";

    my $length = length $out_string;
    print <<EOF;
Path-Name: Record-$count
Content-Length: $length
Document-Type: XML2

EOF
    print $out_string;

    $count++
}



sub fetch_string {
    my $value = shift;

    return '' unless $value;

    # Plain string?
    return $value unless ref $value;

    # look for hash, and if so look join the "ln" elements joined by \n
    # requires special handling of the property in the swish config not to
    # strip the new lines.
    # No error checking to make sure hash and array -- just let it bomb out
    # since it's an unexpected format
    #
    return '' unless $value->{ln};

    return $value->{ln} unless ref $value->{ln};

    # Flag line breaks with a ::
    return join "::", @{$value->{ln}};
}


-------------- search.cgi -----------------------------------


#!/usr/local/bin/perl -w
#!/usr/bin/speedy -w
use strict;

######################################################################
# Skeleton CGI script for searching a Swish-e index with SWISH::API.
# see below for documentation or run "perldoc search.cgi"
#
# Copyright 2003 Bill Moseley - All rights reserved.
#######################################################################


# This needs to be set to where Swish-e installed the Perl modules 

# This is set to where Swish-e's "make install" installed the helper modules.
use lib qw( /usr/local/lib/swish-e/perl );

#------------------- Modules --------------------------------------
use SWISH::API;             # for searching the index file
use SWISH::ParseQuery;      # Parses the query string
use SWISH::PhraseHighlight; # for highlighting
use CGI;                    # provides a param() method -- could use Apache::Request, for example.
use HTML::FillInForm;       # makes the form elements sticky
use Template;               # Template-Toolkit: http://tt2.org or see http://search.cpan.org



#-------------------- Defaults/Parameters --------------------------
# Directory where templates are stored (if using an external template)
use constant INCLUDE_PATH => "/home/moseley/apache";


# Cached variables
my ($template, $swish, %headers, $highlight_object, $fill_in_object );  


# Params used for the highlighting modules
my %highlight_settings = (
    show_words      => 8,  # number of words to show
    occurrences     => 5,   # number of words to show
    max_words       => 100, # max number of words to show if not highlighted words found
    highlight_on    => '<span class="highlight">',
    highlight_off   => '</span>',
);

# This maps nested meta names -- that is, meta names that represent multiple
# metanames (search -w all=foo might really search title=foo and author=foo
# This expansion is only needed for highlighting
my %expand_metas = (
    swishdefault => [ qw/ reference quote title author / ],
);


#--------------------- Code ----------------------------------------
# Entry point for normal CGI programs.
process_request();


# Entry point for mod_perl (not tested yet)
sub handler {
    my $r = shift;
    process_request();
}



sub process_request {
    my $cgi = CGI->new;  # could also be Apache::Request or other fast access to CGI params
    my $query = $cgi->param('query');



    # This data is made available to the template.
    my %params = (
        title   => 'Company Name',  # what-ever data
        index   => 'index.swish-e', # index to search
        myself  => 'search.cgi',    # for use in generating links back to script
        pid   => $$,
    );


    # Create template object if not cached
    $template ||= Template->new( INCLUDE_PATH => INCLUDE_PATH );
    die $template->error unless $template;


    # If a query was passed in then run the search
    if ( $query ) {

        #  Limit by metaname
        if ( my $metaname = $cgi->param('metaname') ) {
            $query = "$metaname=( $query )";
        }


        my $start = $cgi->param('page') || 1;
        my $pagesize = 15;
        $params{search} = run_query( $query, $start, $pagesize, $params{index} );
    }



    # Generate the output from the template

    print $cgi->header;
    my $template_output;

    $template->process( \*DATA, \%params, \$template_output) || die $template->error;
    #$template->process( 'foo.tt', \%params, \$template_output ) || die $template->error;



    # Run output through HTML::FillInForm to make form elements sticky

    $fill_in_object ||= HTML::FillInForm->new;
    print $fill_in_object->fill( scalarref => \$template_output, fobject => $cgi );

}





# Subroutine to run the Swish query.  Returns a hash reference.
# A better design might be to return an object with methods for accessing the data.

sub run_query {
    my ($query, $page, $pagesize, $index) = @_;

    $page = 1 unless defined $page  && $page =~ /^\d+$/;
    $pagesize = 15 unless defined $pagesize && $pagesize =~ /^\d+$/ && $pagesize > 0 && $pagesize < 50; 


    # Create the swish object if not cached.
    # Also read in the header data and initialize the highlighting module

    if ( ! $swish ) {
        $swish = SWISH::API->new( $index );
        die "Failed to create SWISH::API object" unless $swish;
        $swish->AbortLastError if $swish->Error;


        # Now cache header data (used for highlighting)
        %headers = map { lc($_) => ($swish->HeaderValue( $index, $_ )||'') } $swish->HeaderNames;

        # and cache the highlighting object
        $highlight_object = SWISH::PhraseHighlight->new( \%highlight_settings, \%headers );
    }


    # Run the search.  See SWISH::API for more options (like sorting)

    my $results = $swish->Query( $query );

    if ( $swish->Error ) {
        $swish->AbortLastError if $swish->CriticalError;
        return {
            query   => $query,
            message => join( ' ', $swish->ErrorString, $swish->LastErrorMsg ),
        };
    }


    # Seek to the first record of the page requested

    $results->SeekResult( ($page-1) * $pagesize );

    my @records;
    my $result;
    my $cnt = $pagesize;

    # Store the result objects in an array
    push @records, $result while $cnt-- && ($result = $results->NextResult);

    # Now create a filter 'highlight' for use in the template to highlight terms
    # Usage requires passing in the *metaname* associated with the property
    # that's being highlighted -- this allows the program to know what
    # search words to use in highlighting 

    my $parsed_query = parse_query( join ' ', $results->ParsedWords( $index ) );

    # Expand nested queries
    # A search for swishdefault or all might really include multiple tags
    # So expand the keywords to the extra fields
    #
    for my $meta ( keys %expand_metas ) {
        next unless $parsed_query->{$meta};

        # Now duplicate
        # Push to merge any existing, for example is there was a query like:
        # -w all=foo AND title=bar  then title would end up with both
        # Not perfect

        for my $real_meta ( @{$expand_metas{$meta}} ) {
            push @{ $parsed_query->{$real_meta} }, @{$parsed_query->{$meta}};
        }
    }

    $template->context->define_filter( 'highlight', sub {
        my ( $context,  $metaname ) = @_;
        my $phrases = $parsed_query->{$metaname};


        return sub {
            my $text = shift;
            $highlight_object->highlight( \$text, $phrases);
            return $text;
        }
    }, 1 );



    # Return the results structure

    my %query = (
        # parse out the query words
        query   => $query,
        results => \@records,
        hits    => $results->Hits,
        shown   => scalar @records,
        page    => $page,
        start   => ($page-1) * $pagesize,
    );


    $query{prev} = $page-1 if $page > 1;
    $query{next} = $page+1 if $query{start} + $pagesize < $query{hits}-1;

    return \%query;
}



__DATA__


<html>
<head>
    <title>Search Documents</title>

    <style type="text/css">
        a:hover { background: #CCC; }
        body { 
            font-family : verdana,arial,helvetica,sans-serif;
            margin-left: 10em;
            margin-right: 10em;
        }
        .header { background-color: #EEEEEE; padding-left: 5px; }
        .title { font-size: 1.2em; margin-top: 1em; }
        .rank  { color: red; font-size: 0.8em; }
        .description { 
            margin-top: 1em; margin-bottom: 1em; margin-left: 2em; 
            max-width: 700px; /* not supported by IE */
            width:expression(document.body.clientWidth > 600? "600px": "auto" );
        }
        .metadata { margin-left: 2em; font-size: 0.8em; color: green; }
        .metadata a { text-decoration: none; color: green; }
        .highlight { background : #FFFF99; font-weight: bold; }
        .reference { text-align: right; }

    </style>
</head>
<body>

    [% UNLESS hits; "<b>No Results found</b><p>"; END %]
    [% PROCESS form %]
    <p>
    [% IF search %]
        [% IF search.message; '<h2 align="center">'; search.message; "</h2>"; END %]

        [% IF search.shown %]
            [% PROCESS results_header %]
            [% PROCESS display_results %]
        [% END %]
    [% END %]
</body>



[% BLOCK form %]
    <form method="get" action="[% myself %]" enctype="application/x-www-form-urlencoded">
        <input type="text" name="query" value="install" size="40" maxlength="200" />
        <input type="submit" name="submit" value="Search!" /><br>
        Limit search to:
        <input type="radio" name="metaname" value="swishdefault" />All
        <input type="radio" name="metaname" value="title" />Title
        <input type="radio" name="metaname" value="author" />Author
        <input type="radio" name="metaname" value="quote" />Quotation
        <input type="radio" name="metaname" value="reference" />Reference
    </form>
[% END %]


[% BLOCK results_header %]
<div class="header">
    Showing page [% search.page %] 
    ([% search.start +1 %] - [% (search.start + search.shown) %] of [% search.hits %] hits) 
    for <b>[% search.query | html %]</b><br>

    [% USE myurl = url( myself, query=search.query ) %]

    [% IF search.prev %]
       <a href="[% myurl( page=search.prev ) %]">Previous</a> 
    [% END %]
    [% IF search.next %]
       <a href="[% myurl( page=search.next ) %]">Next</a> 
    [% END %]
</div>
[% END %]

[% BLOCK display_results %]
    [% USE date %]
    [% FOREACH item = search.results %]
        <div class="title">
            [% item.Property('author') | highlight('author') %]
            [% item.Property('date') | html %]
            [% item.Property('title') | highlight('title') %]
        </div>

        <div class="description">
            [% item.Property('quote') | highlight('quote') | replace('::', '<br>') %]
        </div>

        <div class="reference">
            <cite>
            [% item.Property('reference') | highlight('reference') | replace('::', '<br>') %]
            </cite>
        </div>
    [% END %]
[% END %]
[% STOP %]

__END__


-- 
Bill Moseley
moseley@hank.org
Received on Thu May 20 16:14:53 2004