Skip to main content.
home | support | download

Back to List Archive

Re: index pdf files with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed May 07 2003 - 20:27:08 GMT
On Wed, May 07, 2003 at 11:31:09AM -0700, Jean Mao wrote:
> Hello, I was trying to index pdf files on our webserver but failed.

"Failed" is only one notch above "doesn't work" for descriptive terms... ;)

> swish-e -c biowulf.conf -S prog -v 0 -f biowulf.index
> 
> the biowulf.conf I used looks like this:
> 
> IndexDir ./prog-bin/spider.pl
> # Tell the spider what to index.
> ReplaceRules remove "http://"

> SwishProgParameters default http://biowulf.nih.gov

The "default" setting for spider only fetches text.html, for example using "default" I get a
message like:

  http://localhost/test.pdf application/pdf != (text/html text/plain)

I do not know if that's your problem or not.

Now, trying with a PDF at your site I get something different:

moseley@bumby:~$ swish-e -c f.conf  -T indexed_words -S prog
Indexing Data Source: "External-Program"
Indexing "/home/moseley/swish-e/prog-bin/spider.pl"
/home/moseley/swish-e/prog-bin/spider.pl: Reading parameters from 'default'

Summary for: http://biowulf.nih.gov/pbsdoc/pbs_user_guide.pdf
Unique URLs: 1  (1.0/sec)
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.

First let's see if I can fetch the document:

moseley(at)not-real.bumby:~$ wget http://biowulf.nih.gov/pbsdoc/pbs_user_guide.pdf
--12:58:18--  http://biowulf.nih.gov/pbsdoc/pbs_user_guide.pdf
           => `pbs_user_guide.pdf'
Resolving biowulf.nih.gov... done.
Connecting to biowulf.nih.gov[128.231.2.11]:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
12:58:18 ERROR 403: Forbidden.

Ignoring that problem, what you need is a spider configuration that knows what to do with 
PDF files.  In the prog-bin directory is SwishSpiderConfig.pl.  That has examples of what 
you can do.

(This will be easier in the next release of swish.)

Here's a complete config you can modify.  The command I'm using is:

  $ swish-e -c f.conf -S prog

f.conf
------

$ cat f.conf

IndexDir /home/moseley/swish-e/prog-bin/spider.pl

ReplaceRules remove "http://"

SwishProgParameters spider.conf

IndexContents HTML* .html .htm .pdf
DefaultContents HTML*
StoreDescription HTML* <body> 200000
MetaNames swishdocpath swishtitle

spider.conf
----------

This is basically just a trimmed down version of the example in SwishSpiderConfig.pl

$ cat spider.conf

# so can find the pdf2html and doc2txt modules

use lib '/home/moseley/swish-e/prog-bin';

@servers = (

    {
        base_url    => 'http://localhost/apache/verhey.pdf',
        agent       => 'swish-e spider http://swish-e.org/',
        email       => 'spider@hank.org',

        # limit to only .html files
        test_url    => sub { $_[0]->path =~ /\.html?$/ },

        delay_min   => .0001,
        keep_alive  => 1,         # enable keep alives requests

        test_url        => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },

        test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } qw{ text/html text/plain application/pdf  application/msword };
            return 1 if $ok;

            print STDERR "$_[0] wrong content type ( $content_type )\n";
            return;
        },

        filter_content  => [ \&pdf, \&doc ],
    },
);    
    

use pdf2html;  # included example pdf converter module
sub pdf {
   my ( $uri, $server, $response, $content_ref ) = @_;

   return 1 unless $response->content_type eq 'application/pdf';

   # for logging counts
   $server->{counts}{'PDF transformed'}++;

   $$content_ref = ${pdf2html( $content_ref, 'title' )};
   $$content_ref =~ tr/ / /s;
   return 1;
}

use doc2txt;  # included example pdf converter module

sub doc {
   my ( $uri, $server, $response, $content_ref ) = @_;

   return 1 unless $response->content_type eq 'application/msword';

   # for logging counts
   $server->{counts}{'DOC transformed'}++;

   $$content_ref = ${doc2txt( $content_ref )};
   return 1;
}

# Must return true...

1;




-- 
Bill Moseley
moseley@hank.org
Received on Wed May 7 20:27:47 2003