Skip to main content.
home | support | download

Back to List Archive

Re: indexing PDF

From: Patrick Fitzgerald <fitz(at)not-real.issl.atl.hp.com>
Date: Mon Aug 10 1998 - 18:16:15 GMT
Rainer Scherg RTC wrote:
>
>I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
>(e.g. to get PDF-files indexed) [I've sent the code changes to Roy].

Could you describe the code changes?  Do you directly index the PDF files?

To index PDF files, I implemented the following workaround:

1. For every PDF file (for example, "myfile.pdf"), create a file
"myfile.pdf.html" that contains the plain text to be indexed.

2. When the search engine returns a hit on a myfile.pdf.html, change the
reference to myfile.pdf.

This works for other filetypes, such as Word files, etc.  The only
disadvantage is that you must  create the separate HTML files.

-- 
Patrick Fitzgerald, HP Internet and System Security Lab
http://issl.atl.hp.com/lab/employees/fitz/
fitz@issl.atl.hp.com  -or-  patrick_fitzgerald@hp.com

(do *not* use pat_fitzgerald@hp.com, that is not me)
Received on Mon Aug 10 10:27:21 1998