Rainer Scherg RTC wrote:
>> Could you describe the code changes?
>Starting a filter program as child process. The output of the filter prog
>will be piped to the swish-e process. [...]
>The _pdf_filter.sh - prog is very simple:
>pdftotext "$1" - 2>/dev/null
>... using the xpdf utility (pdftotext).
Thanks for the pointer, I didn't know such a beast existed.
>> To index PDF files, I implemented the following workaround:
>> 1. For every PDF file (for example, "myfile.pdf"), create a file
>> "myfile.pdf.html" that contains the plain text to be indexed.
>That's is to complicated to handle for me in practice. ;-)
>The filter progs have to convert the contents of a file (pdf, word, xls)
>to standard text and printing it on STDOUT.
I have a lot of large PDF files to be indexed, and pdftotext seems to be a
bit slow. I would hate to waste processor time converting the PDF to text
every time I want to update my search index.
So I created a script that searches my directories for PDF files, then
extracts the text into a .pdf.txt file (only if the .pdf.txt file does not
exist, or is older than the .pdf file). Thus I only have to extract the
text once, instead of every time I create the search index.
Patrick Fitzgerald, HP Internet and System Security Lab
firstname.lastname@example.org -or- email@example.com
(do *not* use firstname.lastname@example.org, that is not me)
Received on Tue Aug 11 16:39:57 1998