Skip to main content.
home | support | download

Back to List Archive

Re: indexing PDF

From: Patrick Fitzgerald <fitz(at)>
Date: Mon Aug 10 1998 - 18:16:15 GMT
Rainer Scherg RTC wrote:
>I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
>(e.g. to get PDF-files indexed) [I've sent the code changes to Roy].

Could you describe the code changes?  Do you directly index the PDF files?

To index PDF files, I implemented the following workaround:

1. For every PDF file (for example, "myfile.pdf"), create a file
"myfile.pdf.html" that contains the plain text to be indexed.

2. When the search engine returns a hit on a myfile.pdf.html, change the
reference to myfile.pdf.

This works for other filetypes, such as Word files, etc.  The only
disadvantage is that you must  create the separate HTML files.

Patrick Fitzgerald, HP Internet and System Security Lab  -or-

(do *not* use, that is not me)
Received on Mon Aug 10 10:27:21 1998