I've been poking around trying to figure out how to get PDF indexing to
work, and haven't had any luck - I'm running into the same problem which
was discussed on this thread (null characters in the PDF files being
replaced with line feed characters, and later on the PDF is seen as
Has this problem been fixed?
The PDFs convert fine when running _pdf2html.pl from the command line on
the file, but fail when converted via the spider.
I am running on Windows 2000; here is my configuration:
SwishProgParameters default http://L0053022/index.htm
IndexOnly .htm .html .pdf
StoreDescription HTML* <body> 10000
MetaNames description keywords
PropertyNames description keywords
FileFilter .pdf _pdf2html.pl '"%p" -'
..and here are the errors I get when doing the PDF conversion via the
http://l0053022/pdf/shareholder/editable/afd-103_aflink.pdf - Using HTML2
parser - Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
C:\SWISH-E\lib\swish-e\_pdf2html.pl: Failed close on pipe to pdfinfo for
C:\TEMP \swtmpfltrcnaaaa: 256 at C:\SWISH-E\lib\swish-e\_pdf2html.pl line
(no words indexed)
I saw SWISH::Filter mentioned as an alternative, but so far have avoided it
since I'm a perl dolt, and it looked like less of a turnkey alternative.
Thanks in advance - Brad
Received on Tue Dec 16 22:16:56 2003