Jjust in case it might be helpful to you, here are some filter settings I've been using successfully for a number of years:
#use FileFilters to process other than HTML
FileFilter .pdf "/usr/local/bin/pdftohtml" "-q -i -stdout -noframes %P"
FileFilter .doc "/usr/local/bin/abiword" "-t html -o fd://1 %P"
This is on FreeBSD; pdftohtml is from http://pdftohtml.sourceforge.net/ (I'm not sure why mine reports version 0.39 while the web page says 0.36 is latest; it is derived from xpdf)
(BTW, trying to get abiword installed with no GUI was quite tedious and required lots of graphics-oriented dependencies that are now just wasting space on the system.)
BTW, lots of nulls sounds like maybe you're getting the still-compressed data stream from the PDFs and the HTML conversion isn't happening. Try running your filter from the command line and see if you get HTML or gibberish/error. Also, make sure that you replace path-to-swish-e with your actual path to swish-e; on my system, this is /usr/local/bin/swish-e
--
Dave Brown
dave@davidhbrown.us
From: users-bounces@lists.swish-e.org [mailto:users-bounces@lists.swish-e.org] On Behalf Of pgeo@gmx.de
Sent: Monday, June 21, 2010 9:01 AM
To: Swish-e Users Discussion List
Subject: [swish-e] swish-e - Help with indexing pdf´s
Hi @ All,
i´ve a short question again:
first:
when i want to index pdf Files, must the Prog xpdf installed at the Server from which I start the index or at the Server from which I start the search in fact the Server where I call the swish.cgi
second:
when I start the Index I got errors like this:
- Using HTML parser - (98779 words)
Document.pdf
Warning: Substituted 2397 embedded null character(s) in file '/Document1.pdf
and so on ... and i don´t know why.
In my swish.conf I wrote:
...
IndexOnly .htm .html .php .doc .xml .pdf
FileFilter /path-to-swishe/filter-bin\_pdf2html.pl "%p -" /\.pdf$/
...
and in my search results are no pdf´s
Do I have to write any more in the conf-file?
Perhaps did somebody have an idea?
Regards
Peter
--
GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl.
Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Jun 21 19:19:43 2010