Skip to main content.
home | support | download

Back to List Archive

Re: indexing PDF

From: Patrick Fitzgerald <fitz(at)>
Date: Wed Aug 12 1998 - 00:28:15 GMT
Rainer Scherg RTC wrote:
>> Could you describe the code changes?
>Starting a filter program as child process. The output of the filter prog 
>will be piped to the swish-e process. [...]
>The - prog is very simple:
>pdftotext "$1" - 2>/dev/null
>... using the xpdf utility  (pdftotext).

Thanks for the pointer, I didn't know such a beast existed.

>> To index PDF files, I implemented the following workaround:
>> 1. For every PDF file (for example, "myfile.pdf"), create a file
>> "myfile.pdf.html" that contains the plain text to be indexed.
>> [...]
>That's is to complicated to handle for me in practice. ;-)
>The filter progs have to convert the contents of a file (pdf, word, xls)
>to standard text and printing it on STDOUT.

I have a lot of large PDF files to be indexed, and pdftotext seems to be a
bit slow.  I would hate to waste processor time converting the PDF to text
every time I want to update my search index.

So I created a script that searches my directories for PDF files, then
extracts the text into a .pdf.txt file (only if the .pdf.txt file does not
exist, or is older than the .pdf file).  Thus I only have to extract the
text once, instead of every time I create the search index.

Patrick Fitzgerald, HP Internet and System Security Lab  -or-

(do *not* use, that is not me)
Received on Tue Aug 11 16:39:57 1998