--- Bill Moseley <moseley@hank.org> wrote:
> I sure wish pdftotext printed the file name with its errors.
Agreed... ;-(
> So when you pass the file directly to pdftotext and pdfinfo it works
> fine?
Yes.
> Looks like you are passing an invalid file to pdftotext. Have your
> filter write it to disk and compare with the source file.
> Try indexing a single file so you can be sure of what file is
> generating the errors.
Bill,
Thanks for the list of steps. It appears to be a bug in swish-e, but I
open to other interpretations.
Here's what I have:
1. I enhanced spider.pl to write $$content to a temporary file.
2. I enhanced the filter (filter-bin/_pdf2html.pl) to copy $file
to another temporary file (using File::Copy->copy)
The temp file from step 1 is identical to the source file (size and
sum). xpdf runs against this file by hand without problems.
The temp file from step 2 is different: same size, different sum.
xpdf blows up when run against this file. I compared the files via "od
-c" and it seems that most (all?) of the \0 in the original PDF have
been converted to \n.
Question: is anyone successfully indexing PDF documents on Linux with
swish-e-2.2.1 ? If so, can you please post your swish-e configuration
indicating how you are filtering PDF to HTML (or text)?
I'm going to have to find a fix for this tonight (deadlines and all
that). Can someone point me to where in the code I should be looking
for fixing this? Roughly, I know it will be between the spidering and
the filtering, but I'm not sure where that code is (I've only briefly
glanced over the code before).
Thanks again,
greg_fenton.
=====
Greg Fenton
greg_fenton@yahoo.com
__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com
Received on Thu Oct 10 22:52:44 2002