Skip to main content.
home | support | download

Back to List Archive

Re: Filtering

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jul 25 2003 - 19:52:16 GMT
On Fri, Jul 25, 2003 at 12:24:37PM -0700, Roubart Capcap wrote:
> I am planning to add xl2csv as another filter to parse MS Excel files besides the XLtoHTML.pm filter.  I copied Doc2txt.pm and made it xls2csv.pm with the following changes:
> 
> package SWISH::Filters::xls2csv;
> use vars qw/ %FilterInfo $VERSION /;
> 
> 
> $VERSION = '0.01';
> 
> %FilterInfo = (
>     type     => 2,  # normal filter
>     priority => 50, # normal priority 1-100
> );
> 
> sub filter {
>     my $filter = shift;
> 
>     # Do we care about this document?
>     return unless $filter->content_type =~ m!application/vnd.ms-excel!;
> 
>     # We need a file name to pass to the xls2csv program
>     my $file = $filter->fetch_filename;
> 
>     # Grab output from running program
>     my $content = $filter->run_program( 'xls2csv', $file );
> 
>     # update the document's content type
>     $filter->set_content_type( 'text/plain' );
> 
> How and where do I specify that xls files should be parsed by both
> filters.  

Both filters?  If you are converting to csv then you wouldn't want the 
other to Excel filter to process it, would you?

Anyway, the type and priority are what set the sort order of the 
filters.  If you have a filter where you still want other filters to 
process it instead of finishing after your filter you call 
$filter->set_continue.  (All this if from looking at the docs, since I 
can't remember how it works....)

>And how do I specify that the output of xls2csv should be
> parsed by the TXT2 parser?

The way swish works normally is by mapping file extensions to the 
parser.  That's not a very good way to go, of course.  Someday I'll add 
processing by content-type internal to swish (or that's been the plan 
for a while).  But if using -S prog you can set the parser in a header.

I see this in spider.pl:

    # Set the parser type if specified by filtering
    if ( my $type = delete $server->{parser_type} ) {
        $headers .= "Document-Type: $type\n";

    } elsif ( $response->content_type =~ m!^text/(html|xml|plain)! ) {
        $type = $1 eq 'plain' ? 'txt' : $1;
        $headers .= "Document-Type: $type*\n";
    }

So it's setting a Document-Type: header to select the parser.

Does that help?






-- 
Bill Moseley
moseley@hank.org
Received on Fri Jul 25 19:52:29 2003