Skip to main content.
home | support | download

Back to List Archive

Indexing pdf files

From: David Cogley <david(at)not-real.cogley.com>
Date: Tue Jan 28 2003 - 22:33:29 GMT
I'm having difficulty with indexing pdf files.  I create a large index, but
it seems to be garbage.  "strings gimppdr" gives me no terms I expected.

I've set up and tested swish-e for indexing html files.  I implemented a
cgi-script for searching the collection of html files and everything works
as expected.

Following the swish-e documentation, I've tried to implement a FileFilter
directive to allow for indexing of pdf files.  I've entered these lines in
the configuration file:

     IndexOnly     .pdf
     FileFilter    .pdf   /home/david/bin/_pdf2html.pl

To minimize any problems with non-standard pdf files, I've run tests on the
gimp user-guide.pdf.  I've created a text file of the pdf contents with:

     pdftotext user-guide.pdf

This allows me to check the text inside the pdf file.  When I

     swish-e -c gimppdr.conf

I get gimppdr.index which is 402896 bytes.  When I "strings gimppdr.index >
gimppdr_index.log", I get gimppdr_index.log which is 595 bytes.  If I run
the cgi script against gimppdr.index, I get no hits for the terms in
user-guide.txt.

What am I doing wrong?  I'm running Red Hat 8.0 Professional.

David Cogley
Received on Tue Jan 28 22:33:57 2003