I've spend several frustrating hours debugging an index job that uses spider.pl, and having found the solution, I thought I'd share it to save others the trouble. I have a site of about 1,000 links, mostly HTML and PDF files. I used the built-in spider.conf and the filter as recommended in the docs. (swish-e 2.2.3, RedHat 8.0 - 2.4.18.) It worked wonderfully on the development server, then failed on the new production server (of course). The spider process failed on several of the PDF files, with a message "err: External program failed to return required headers Path-Name: & Content-Length:".
I took one of the offending PDFs and ran it through pdf2html.pm. That failed too, on a "tr / ..." line 201. After much hunting I discovered that the LANG environment variable on the production server was "en_US.UTF-8", while the dev server was simply "en_US". When I removed the "UTF-8" from the production box, it worked great! So, it appears that pdf2html.pm wants to do its transliteration in Unicode rather than UTF-8, at least, that's my uneducated guess.
Anyway, I'm good to go now.
Received on Sat Jan 25 00:43:41 2003