Skip to main content.
home | support | download

Back to List Archive

Re: indexing PDF

From: Patrick Fitzgerald <fitz(at)not-real.issl.atl.hp.com>
Date: Wed Aug 12 1998 - 00:28:15 GMT
Rainer Scherg RTC wrote:
>> Could you describe the code changes?
>
>Starting a filter program as child process. The output of the filter prog 
>will be piped to the swish-e process. [...]
>
>The _pdf_filter.sh - prog is very simple:
>
>#!/bin/sh
>pdftotext "$1" - 2>/dev/null
>
>... using the xpdf utility  (pdftotext).

Thanks for the pointer, I didn't know such a beast existed.


>> To index PDF files, I implemented the following workaround:
>> 
>> 1. For every PDF file (for example, "myfile.pdf"), create a file
>> "myfile.pdf.html" that contains the plain text to be indexed.
>> [...]
>
>That's is to complicated to handle for me in practice. ;-)
>The filter progs have to convert the contents of a file (pdf, word, xls)
>to standard text and printing it on STDOUT.

I have a lot of large PDF files to be indexed, and pdftotext seems to be a
bit slow.  I would hate to waste processor time converting the PDF to text
every time I want to update my search index.

So I created a script that searches my directories for PDF files, then
extracts the text into a .pdf.txt file (only if the .pdf.txt file does not
exist, or is older than the .pdf file).  Thus I only have to extract the
text once, instead of every time I create the search index.

-- 
Patrick Fitzgerald, HP Internet and System Security Lab
http://issl.atl.hp.com/lab/employees/fitz/
fitz@issl.atl.hp.com  -or-  patrick_fitzgerald@hp.com

(do *not* use pat_fitzgerald@hp.com, that is not me)
Received on Tue Aug 11 16:39:57 1998