Skip to main content.
home | support | download

Back to List Archive

Re: PDF to HTML causing swish-e to crash

From: Greg Fenton <greg_fenton(at)not-real.yahoo.com>
Date: Thu Oct 10 2002 - 22:48:55 GMT
--- Bill Moseley <moseley@hank.org> wrote:
> I sure wish pdftotext printed the file name with its errors.

Agreed...  ;-(

> So when you pass the file directly to pdftotext and pdfinfo it works
> fine?

Yes.

> Looks like you are passing an invalid file to pdftotext.  Have your
> filter write it to disk and compare with the source file.
> Try indexing a single file so you can be sure of what file is
> generating the errors.

Bill,

Thanks for the list of steps.  It appears to be a bug in swish-e, but I
open to other interpretations.

Here's what I have:

1. I enhanced spider.pl to write $$content to a temporary file.
2. I enhanced the filter (filter-bin/_pdf2html.pl) to copy $file
   to another temporary file (using File::Copy->copy)

The temp file from step 1 is identical to the source file (size and
sum).  xpdf runs against this file by hand without problems.

The temp file from step 2 is different: same size, different sum.
xpdf blows up when run against this file.  I compared the files via "od
-c" and it seems that most (all?) of the \0 in the original PDF have
been converted to \n.


Question: is anyone successfully indexing PDF documents on Linux with
swish-e-2.2.1 ?  If so, can you please post your swish-e configuration
indicating how you are filtering PDF to HTML (or text)?

I'm going to have to find a fix for this tonight (deadlines and all
that).  Can someone point me to where in the code I should be looking
for fixing this?  Roughly, I know it will be between the spidering and
the filtering, but I'm not sure where that code is (I've only briefly
glanced over the code before).

Thanks again,
greg_fenton.

=====
Greg Fenton
greg_fenton@yahoo.com

__________________________________________________
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More
http://faith.yahoo.com
Received on Thu Oct 10 22:52:44 2002