Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] HTML Parser chokes when indexing image pdf's

From: Dr Michael Daly <"Dr>
Date: Thu, 15 Mar 2012 13:14:06 +1100 (EST)
The funny thing is that *no* Filefilter options are specified in my

 IndexOnly .htm .html .txt .doc .pdf .xls
 IndexContents TXT* .txt
 DefaultContents HTML*

I can see both /opt/bin/catdoc and /opt/bin/pdttotext , with /opt/bin
being in $PATH so I presume there must be some hard coding within swish-e
that picks them up without the configuration of eg FileFilter

Should these directives be added?:
FileFilter  .pdf    pdf2html
FileFilter .pdf     pdftotext   "'%p' -"
FileFilter .doc     /opt/bin/catdoc "-s8859-1 -d8859-1 %p"

If not, can the parsing errors be ignored?


Dr Michael Daly wrote on 3/14/12 6:26 AM:
> Here is the contents of the config file:
>  IndexDir /share/MD0_DATA/server_dir/Correspondence/2011_Correspondence
>  IndexOnly .htm .html .txt .doc .pdf .xls
>  IndexContents TXT* .txt
>  DefaultContents HTML*
>  ParserWarnLevel 9
>  #(as I said ParserWarnLevel 1 abolishes the warnings)
>  IndexFile /share/MD0_DATA/swish-e-files/swish-e-index/swish_1.index
> The command invocations:
> 1. To index:
>    swish-e -c /share/MD0_DATA/swish-e-config/swish_1.conf
> 2. To search the .index file:
>    swish-e -f /share/MD0_DATA/swish-e-index/swish_1.index -w employee

make sure you've read this:

and then post back with any questions.

Peter Karman  .  .  peter(at)
Users mailing list

Users mailing list
Received on Thu Mar 15 2012 - 02:24:12 GMT