I'm having problems getting the pdf filter going.
I have SWISH-E 2.0 running on RedHat Linux 6.2
I'm creating an index using the -S http to spider
a single .pdf file (just to test the filter)
from my pdftest.config file
FileFilter .pdf pdf-filter.sh
FileFilter .doc doc-filter.sh
# Adobe PDF filter
# see: http://www.foolabs.com/xpdf/
/usr/bin/pdftotext "$1" - 2>/dev/null
/usr/bin/pdftotext "$1" - >/tmp/gersee
when I run the index, it only indexes 2 words
[swish@listserv adamtest]$ ../src/swish-e -S http -c pdftest.config
Indexing Data Source: "HTTP-Crawler"
retrieving http://www.arb.ca.gov/msprog/spillcon/wdec00.pdf (0)...
Removing very common words...
no words removed.
Writing main index...
Computing hash table ...
Writing header ...
Writing index entries ...
Writing stopwords ...
2 unique words indexed.
Writing file index...
Writing file list ...
Writing file offsets ...
Writing MetaNames ...
Writing offsets (2)...
1 file indexed.
Running time: 1 second.
At this point, the /tmp/gersee file that is created
in line 5 of the pdf-filter.sh does contain the text
conversion of the pdf file, but for some reason the
STDOUT of the pdftotext filter isn't making it back
into the swish indexing.
Can anyone tell me what I'm missing here?
Received on Thu Sep 20 22:02:49 2001