This is a resend - I'm curious if anyone here is using Swish-e in a Windows
Environment and whether they've seen issues with indexing the content of
PDFs as described below. If more information about my setup is needed
please let me know.
I've been struggling with using swish-e on a Windows 2000 server. I'm
spidering the target site and when I hit a pdf file with "errors" (Missing
'endstream') the spider can lockup.
I've replaced the pdftotext program with the latest version (v3 1/22/2004)
and tested it on the problematic pdfs. It throws the same errors but does
create a "text" file with some garbage characters with all the text. It
appears that swish-e is either waiting for an exit code that never comes
from pdftotext or can not handle the output with garbage characters.
Has anyone else seen this?
Here's some config info, if necessary:
Swish-e v2.4.2 for windows
batch file for spidering (wrapped for reading)
-S prog -v 3 -c
ReplaceRules remove http://www.site.com
IndexContents HTML* .asp .htm .html .pdf
StoreDescription HTML* <body> 320
Received on Fri Aug 6 15:14:29 2004