I'm having difficulty with indexing pdf files. I create a large index, but
it seems to be garbage. "strings gimppdr" gives me no terms I expected.
I've set up and tested swish-e for indexing html files. I implemented a
cgi-script for searching the collection of html files and everything works
as expected.
Following the swish-e documentation, I've tried to implement a FileFilter
directive to allow for indexing of pdf files. I've entered these lines in
the configuration file:
IndexOnly .pdf
FileFilter .pdf /home/david/bin/_pdf2html.pl
To minimize any problems with non-standard pdf files, I've run tests on the
gimp user-guide.pdf. I've created a text file of the pdf contents with:
pdftotext user-guide.pdf
This allows me to check the text inside the pdf file. When I
swish-e -c gimppdr.conf
I get gimppdr.index which is 402896 bytes. When I "strings gimppdr.index >
gimppdr_index.log", I get gimppdr_index.log which is 595 bytes. If I run
the cgi script against gimppdr.index, I get no hits for the terms in
user-guide.txt.
What am I doing wrong? I'm running Red Hat 8.0 Professional.
David Cogley
Received on Tue Jan 28 22:33:57 2003