Skip to main content.
home | support | download

Back to List Archive

Re: index pdf files with spider.pl

From: Erik Lyons <ELyons(at)not-real.mail.open.org>
Date: Tue Jul 22 2003 - 23:38:31 GMT
After several weeks of exclaiming joyful praise to the initial "S" in
SWISH, I stumbled across the example quoted below. It runs and reports
"PDF transformed:      2,009  (19.7/sec)", but no PDF files can be
returned in any search results. As an added bonus, all document titles
that are in the search results appear as "(NULL)". Are these problems
related, or do I have 2 different gleaming horizons of delight to
explore?

;-)

--
-e.l.


On Wed, 7 May 2003 13:27:27 -0700 (PDT), Bill Monroe wrote:
..
> 
> (This will be easier in the next release of swish.)
> 
> 
> Here's a complete config you can modify. The command I'm using is:
> 
> 
> $ swish-e -c f.conf -S prog
> 
> 
> f.conf
> ------
> 
> 
> $ cat f.conf
> 
> 
> IndexDir /home/moseley/swish-e/prog-bin/spider.pl
> 
> 
> ReplaceRules remove "http://"
> 
> 
> SwishProgParameters spider.conf
> 
> 
> IndexContents HTML* .html .htm .pdf
> DefaultContents HTML*
> StoreDescription HTML* <body> 200000
> MetaNames swishdocpath swishtitle
> 
> 
> spider.conf
> ----------
> 
> 
> This is basically just a trimmed down version of the example in
SwishSpiderConfig.pl
> 
> 
> $ cat spider.conf
> 
> 
> # so can find the pdf2html and doc2txt modules
> 
> 
> use lib '/home/moseley/swish-e/prog-bin';
> 
> 
> @servers = (
> 
> 
> {
> base_url => 'http://localhost/apache/verhey.pdf',
> agent => 'swish-e spider http://swish-e.org/',
> email => 'spider@hank.org',
> 
> 
> # limit to only .html files
> test_url => sub { $_[0]->path =~ /\.html?$/ },
> 
> 
> delay_min => .0001,
> keep_alive => 1, # enable keep alives requests
> 
> 
> test_url => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },
> 
> 
> test_response => sub {
> my $content_type = $_[2]->content_type;
> my $ok = grep { $_ eq $content_type } qw{ text/html text/plain
application/pdf application/msword };
> return 1 if $ok;
> 
> 
> print STDERR "$_[0] wrong content type ( $content_type )\n";
> return;
> },
> 
> 
> filter_content => [ \&pdf, \&doc ],
> },
> ); 
> 
> 
> 
> use pdf2html; # included example pdf converter module
> sub pdf {
> my ( $uri, $server, $response, $content_ref ) = @_;
> 
> 
> return 1 unless $response->content_type eq 'application/pdf';
> 
> 
> # for logging counts
> $server->{counts}{'PDF transformed'}++;
> 
> 
> $$content_ref = ${pdf2html( $content_ref, 'title' )};
> $$content_ref =~ tr/ / /s;
> return 1;
> }
> 
> 
> use doc2txt; # included example pdf converter module
> 
> 
> sub doc {
> my ( $uri, $server, $response, $content_ref ) = @_;
> 
> 
> return 1 unless $response->content_type eq 'application/msword';
> 
> 
> # for logging counts
> $server->{counts}{'DOC transformed'}++;
> 
> 
> $$content_ref = ${doc2txt( $content_ref )};
> return 1;
> }
> 
> 
> # Must return true...
> 
> 
> 1;
Received on Tue Jul 22 23:38:42 2003