>Rainer Scherg RTC wrote:
>>
>>I've made some enhancements to swish-e 1.1 to index Non-Text or HTML files
>>(e.g. to get PDF-files indexed) [I've sent the code changes to Roy].
>Could you describe the code changes? Do you directly index the PDF files?
>To index PDF files, I implemented the following workaround:
>1. For every PDF file (for example, "myfile.pdf"), create a file
>"myfile.pdf.html" that contains the plain text to be indexed.
>2. When the search engine returns a hit on a myfile.pdf.html, change the
>reference to myfile.pdf.
>This works for other filetypes, such as Word files, etc. The only
>disadvantage is that you must create the separate HTML files.
We do a similar thing. We've got several manuals in PDF format. We use
Acrobat to spit out text versions of each PDF file which we put in a
different directory. Ie:
/manual/chapter1.pdf <-- real pdf
/manual/txt/chapter1.pdf <-- text equivalent
Then we use a "ReplaceRules" in swish.conf to replace /manual/txt with
/manual. By giving the text file the same name we don't have to deal with
any weird machinations later on and it works with any search interface we
want to use.
_______________________________________________________________________
Rick Beebe (203) 785-4566
Network Engineering Manager FAX: (203) 737-4037
ITS-Med Technology Operations Richard.Beebe@yale.edu
Yale University School of Medicine
P.O. Box 208089, New Haven, CT 06520-8089
_______________________________________________________________________
Received on Mon Aug 10 11:18:09 1998