Re: indexing PDF

From: Patrick Fitzgerald <fitz(at)>
Date: Mon Aug 10 1998 - 18:16:15 GMT
Rainer Scherg RTC wrote:
>I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
>(e.g. to get PDF-files indexed) [I've sent the code changes to Roy].

Could you describe the code changes?  Do you directly index the PDF files?

To index PDF files, I implemented the following workaround:

1. For every PDF file (for example, "myfile.pdf"), create a file
"myfile.pdf.html" that contains the plain text to be indexed.

2. When the search engine returns a hit on a myfile.pdf.html, change the
reference to myfile.pdf.

This works for other filetypes, such as Word files, etc.  The only
disadvantage is that you must  create the separate HTML files.

