Re: [swish-e] swish-e - Help with indexing pdf´s

From: David Brown <dave(at)>
Date: Mon Jun 21 2010 - 23:19:25 GMT
Jjust in case it might be helpful to you, here are some filter settings I've been using successfully for a number of years:
#use FileFilters to process other than HTML
FileFilter .pdf "/usr/local/bin/pdftohtml" "-q -i -stdout -noframes %P"
FileFilter .doc "/usr/local/bin/abiword" "-t html -o fd://1 %P"
This is on FreeBSD; pdftohtml  is from  (I'm not sure why mine reports version 0.39 while the web page says 0.36 is latest; it is derived from xpdf)
(BTW,  trying to get abiword installed with no GUI was quite tedious and required lots of graphics-oriented dependencies that are now just wasting space on the system.)
BTW, lots of nulls sounds like maybe you're getting the still-compressed data stream from the PDFs and the HTML conversion isn't happening. Try running your filter from the command line and see if you get HTML or gibberish/error. Also, make sure that you replace path-to-swish-e with your actual path to swish-e; on my system, this is /usr/local/bin/swish-e
Dave Brown
From: [] On Behalf Of
Sent: Monday, June 21, 2010 9:01 AM
To: Swish-e Users Discussion List
Subject: [swish-e] swish-e - Help with indexing pdf´s

Hi @ All,

i´ve a short question again:

when i want to index pdf Files, must the Prog xpdf installed at the Server from which I start the index or at the Server from which I start the search in fact the Server where I call the swish.cgi

when I start the Index I got errors like this:

 - Using HTML parser -  (98779 words)
Warning: Substituted 2397 embedded null character(s) in file '/Document1.pdf

and so on ... and i don´t know why.
In my swish.conf I wrote:

IndexOnly .htm .html .php .doc .xml .pdf
FileFilter /path-to-swishe/filter-bin\ "%p -" /\.pdf$/

and in my search results are no pdf´s
Do I have to write any more in the conf-file?
Perhaps did somebody have an idea?

