Hi,
I'm having trouble getting swish-e to index PDF files using spider.pl
and am now at a loss
as to where to look next. I've looked on the swish-e site but have
failed to find any further info
that helps me with this problem. I'm using pdftotext to do the
conversion. I have successfully got
swish-e to index a single PDF file using the -S fs and -S http options,
but can't for the life
of me figure out why it won't work crawling the web server. Can anyone
shed any
light as to what I'm possibly doing wrong??
Any help much appreciated. Thanks.
Rosalyn.
The output I get is...
prismweb@hermes> /usr/local/bin/swish-e -c swish.conf -S prog
Indexing Data Source: "External-Program"
Indexing "/usr/local/lib/swish-e/spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from
'spider_prism.config'
Summary for: http://prism.enes.org/Publications/Reports/Report05.pdf
Connection: Close: 1 (1.0/sec)
Total Bytes: 72,475 (72475.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
application/pdf->text/html: 1 (1.0/sec)
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
http://prism.enes.org/Publications/Reports/Report05.pdf - Using HTML2
parser - (no words indexed)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 8 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
8 unique words indexed.
5 properties sorted.
1 file indexed. 72,475 total bytes. 8 total words.
Elapsed time: 00:00:02 CPU time: 00:00:00
Indexing done!
**My swish.conf contains...
prismweb@hermes> more swish.conf
# Administrative Directives
IndexName "PRISM Site Index"
IndexDescription "This is a swish index of the PRISM web site!"
IndexAdmin "R.S.Hatcher <r.s.hatcher@rdg.ac.uk>"
IndexFile
/export/hermes/hermes-01/apache/htdocs/htdocs-prism/live/search/swish_files/prism.index
ReplaceRules replace
/export/hermes/hermes-01/apache/htdocs/htdocs-prism/ http://prism.enes.org/
# Use spider.pl as the external program:
IndexDir /usr/local/lib/swish-e/spider.pl
# now make the specific configuration file for the spider.pl - all those
file wou don't want spidered in prism
SwishProgParameters spider_prism.config
obeyRobotsNoIndex yes
IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.
IndexContents HTML* .htm .html .php
IndexContents TXT* .txt
# Otherwise, use the HTML parser
DefaultContents HTML*
NoContents .gif .jpg .jpeg .ps
FileFilter .pdf pdftotext "'%p' -"
IgnoreTotalWordCountWhenRanking yes
ConvertHTMLEntities yes
# Allow extra searching by title, path, description
Metanames swishtitle swishdocpath swishdescription
# Set StoreDescription for each parser
# to display context with search results
StoreDescription TXT* 10000
StoreDescription HTML* <body> 10000
**and the spider_prism.config contains...
@servers = (
{
# base_url =>
'http://prism.enes.org/index.php',
base_url =>
'http://prism.enes.org/Publications/Reports/Report05.pdf',
same_hosts => 'www.prism.enes.org',
email => 'r.s.hatcher@rdg.ac.uk',
use_default_config => 1,
use_md5 => 1, # If true, this will use
the Digest::MD5
# module to create
checksums on content
# This will very likely
catch files
# with differet URLs
that are the same
# content. Will trap /
and /index.html,
# for example.
delay_sec => 0, # Delay in seconds
between requests
remove_leading_dots => 1,
keep_alive => 1, # Try to keep the
connection open
test_url => \&test_url,
},
);
1;
sub test_url {
use URI::QueryParam;
my $uri = shift;
# if sort_orderis in theURL then don't return it
my $id = $uri->query_param('sort_order');
return 0 if $id && $id =~ /ASC|DESC/;
return 0 if $uri->path =~ /_inc|Connections/;
return 0 if $uri->path =~ /Images|css|Templates/;
return 0 if $uri->path =~ /graph|admin|make_pdf/;
return 0 if $uri->path =~ /Internal/;
return 0 if $uri->path =~ /\.(xml|old|css)?$/;
return 0 if $uri->path =~ /Documentation/;
return 1 if $uri->path =~ /\.(html|htm|php|pdf)?$/;
}
1;
**Using http method:
prismweb@hermes> local/bin/swish-e -c swish.conf -S http -i
http://www.prism.enes.org/Publications/Reports/Report05.pdf
Indexing Data Source: "HTTP-Crawler"
Indexing "http://www.prism.enes.org/Publications/Reports/Report05.pdf"
Now fetching [http://www.prism.enes.org/robots.txt]...Status: 404.
retrieving http://www.prism.enes.org/Publications/Reports/Report05.pdf
(0)...
sleeping 5 seconds before fetching
http://www.prism.enes.org/Publications/Reports/Report05.pdf
Now fetching
[http://www.prism.enes.org/Publications/Reports/Report05.pdf]...Status:
200. application/pdf
- Using HTML2 parser - (11861 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1,337 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
1,337 unique words indexed.
5 properties sorted.
1 file indexed. 433,766 total bytes. 11,870 total words.
Elapsed time: 00:00:08 CPU time: 00:00:00
Indexing done!
Sure enough index has been created ok
prismweb@hermes> /usr/local/bin/swish-e -w SRE -f
prism.index
# SWISH format: 2.4.3
# Search words: SRE
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.031 seconds
1000 http://www.prism.enes.org/Publications/Reports/Report05.pdf
"Report05.pdf" 433766
.
prismweb@hermes>
Similar results for
prismweb@hermes> local/bin/swish-e -c swish.conf -S fs -i
/home/prismweb/live/Publications/Reports/Report05.pdf
--
------------------------------------------------------------------------
Rosalyn Hatcher
CGAM, Dept. of Meteorology, University of Reading,
Earley Gate, Reading. RG6 6BB
Email: r.s.hatcher@reading.ac.uk Tel: +44 (0) 118 378 7841
Received on Fri Jan 6 03:49:00 2006