>Rainer Scherg RTC wrote:
>>I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
>>(e.g. to get PDF-files indexed) [I've sent the code changes to Roy].
>Could you describe the code changes? Do you directly index the PDF files?
>To index PDF files, I implemented the following workaround:
>1. For every PDF file (for example, "myfile.pdf"), create a file
>"myfile.pdf.html" that contains the plain text to be indexed.
>2. When the search engine returns a hit on a myfile.pdf.html, change the
>reference to myfile.pdf.
>This works for other filetypes, such as Word files, etc. The only
>disadvantage is that you must create the separate HTML files.
We do a similar thing. We've got several manuals in PDF format. We use
Acrobat to spit out text versions of each PDF file which we put in a
different directory. Ie:
/manual/chapter1.pdf <-- real pdf
/manual/txt/chapter1.pdf <-- text equivalent
Then we use a "ReplaceRules" in swish.conf to replace /manual/txt with
/manual. By giving the text file the same name we don't have to deal with
any weird machinations later on and it works with any search interface we
want to use.
Rick Beebe (203) 785-4566
Network Engineering Manager FAX: (203) 737-4037
ITS-Med Technology Operations Richard.Beebe@yale.edu
Yale University School of Medicine
P.O. Box 208089, New Haven, CT 06520-8089
Received on Mon Aug 10 11:18:09 1998