On Wed, May 07, 2003 at 11:31:09AM -0700, Jean Mao wrote:
> Hello, I was trying to index pdf files on our webserver but failed.
"Failed" is only one notch above "doesn't work" for descriptive terms... ;)
> swish-e -c biowulf.conf -S prog -v 0 -f biowulf.index
>
> the biowulf.conf I used looks like this:
>
> IndexDir ./prog-bin/spider.pl
> # Tell the spider what to index.
> ReplaceRules remove "http://"
> SwishProgParameters default http://biowulf.nih.gov
The "default" setting for spider only fetches text.html, for example using "default" I get a
message like:
http://localhost/test.pdf application/pdf != (text/html text/plain)
I do not know if that's your problem or not.
Now, trying with a PDF at your site I get something different:
moseley@bumby:~$ swish-e -c f.conf -T indexed_words -S prog
Indexing Data Source: "External-Program"
Indexing "/home/moseley/swish-e/prog-bin/spider.pl"
/home/moseley/swish-e/prog-bin/spider.pl: Reading parameters from 'default'
Summary for: http://biowulf.nih.gov/pbsdoc/pbs_user_guide.pdf
Unique URLs: 1 (1.0/sec)
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.
First let's see if I can fetch the document:
moseley(at)not-real.bumby:~$ wget http://biowulf.nih.gov/pbsdoc/pbs_user_guide.pdf
--12:58:18-- http://biowulf.nih.gov/pbsdoc/pbs_user_guide.pdf
=> `pbs_user_guide.pdf'
Resolving biowulf.nih.gov... done.
Connecting to biowulf.nih.gov[128.231.2.11]:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
12:58:18 ERROR 403: Forbidden.
Ignoring that problem, what you need is a spider configuration that knows what to do with
PDF files. In the prog-bin directory is SwishSpiderConfig.pl. That has examples of what
you can do.
(This will be easier in the next release of swish.)
Here's a complete config you can modify. The command I'm using is:
$ swish-e -c f.conf -S prog
f.conf
------
$ cat f.conf
IndexDir /home/moseley/swish-e/prog-bin/spider.pl
ReplaceRules remove "http://"
SwishProgParameters spider.conf
IndexContents HTML* .html .htm .pdf
DefaultContents HTML*
StoreDescription HTML* <body> 200000
MetaNames swishdocpath swishtitle
spider.conf
----------
This is basically just a trimmed down version of the example in SwishSpiderConfig.pl
$ cat spider.conf
# so can find the pdf2html and doc2txt modules
use lib '/home/moseley/swish-e/prog-bin';
@servers = (
{
base_url => 'http://localhost/apache/verhey.pdf',
agent => 'swish-e spider http://swish-e.org/',
email => 'spider@hank.org',
# limit to only .html files
test_url => sub { $_[0]->path =~ /\.html?$/ },
delay_min => .0001,
keep_alive => 1, # enable keep alives requests
test_url => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },
test_response => sub {
my $content_type = $_[2]->content_type;
my $ok = grep { $_ eq $content_type } qw{ text/html text/plain application/pdf application/msword };
return 1 if $ok;
print STDERR "$_[0] wrong content type ( $content_type )\n";
return;
},
filter_content => [ \&pdf, \&doc ],
},
);
use pdf2html; # included example pdf converter module
sub pdf {
my ( $uri, $server, $response, $content_ref ) = @_;
return 1 unless $response->content_type eq 'application/pdf';
# for logging counts
$server->{counts}{'PDF transformed'}++;
$$content_ref = ${pdf2html( $content_ref, 'title' )};
$$content_ref =~ tr/ / /s;
return 1;
}
use doc2txt; # included example pdf converter module
sub doc {
my ( $uri, $server, $response, $content_ref ) = @_;
return 1 unless $response->content_type eq 'application/msword';
# for logging counts
$server->{counts}{'DOC transformed'}++;
$$content_ref = ${doc2txt( $content_ref )};
return 1;
}
# Must return true...
1;
--
Bill Moseley
moseley@hank.org
Received on Wed May 7 20:27:47 2003