Skip to main content.
home | support | download

Back to List Archive

Re: pdf2html.pm and File::Temp.pm

From: Gerald Klaas <gklaas(at)not-real.arb.ca.gov>
Date: Wed Jan 30 2002 - 21:25:29 GMT
Bill,
I have downloaded the patch, but haven't installed it yet so I could do the
pdfinfo test you suggested.

Running my spider, again I see
<<<
>> +Fetched 3 Cnt: 123 http://inside.arb.ca.gov/ds/regact/18dg01.pdf 200 OK appl
ication/pdf 14810647 parent:http://inside.arb.ca.gov/ds/regact/comlogdistributed
generation.htm
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Top-level pages object is wrong type (null)
Error: Couldn't read page catalog
-Skipped http://inside.arb.ca.gov/ds/regact/18dg01.pdf due to 'filter_content' u
ser supplied function #1 death '../src/spider.pl: Failed close on pipe to pdfinf
o for /tmp/tjBwLfoprX: 256 at /app/swish/prog-bin/pdf2html.pm line 138.
>>>

I cancelled (Cntl-C) the spider and looked for the /tmp/tjBwLfoprX file it was using.

<<<
[swish@o7 /tmp]$ ls -la tjB*
-rw-------    1 swish    swish     4999156 Jan 30 12:52 tjBwLfoprX
[swish@o7 /tmp]$ pdfinfo tjBwLfoprX
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Top-level pages object is wrong type (null)
Error: Couldn't read page catalog
>>>

In my browser, I went to the URL http://inside.arb.ca.gov/ds/regact/18dg01.pdf
and indeed the pdf file does open in Acrobat, but I notice that it downloads
at 14463 Kb.  Noticing the proximity of the temp filesize to 5Mb, I wonder if
something is cutting off the download, but with a 200 code it happily goes to
the pdf2html.pm, which promptly barfs when it runs out of file before it is
finished loading data.


Bill Moseley wrote:

> At 10:34 AM 01/30/02 -0800, Gerald Klaas wrote:
> >user supplied function #1 death '../src/spider.pl: Failed close on pipe to pdfin
> >fo for /tmp/Gpcivvv24w: 256 at /app/swish/prog-bin/pdf2html.pm line 138.
> >'
> >---end snip---
> >
> >line 138 in pdf2html.pm is
> >close $sym or die "$0: Failed close on pipe to pdfinfo for $file: $?";
>
> Try running the pdfinfo on the file from the shell and look at its exit status and any possible error messages from pdfinfo.  I'd be interested in what it says.
>
> I'll just checked in a patch.  Let me know if you can't grab it from CVS or from sourceforge.
>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/swishe/swish-e/prog-bin/?sortby=date#dirlist
>
> --
> Bill Moseley
> mailto:moseley@hank.org
Received on Wed Jan 30 21:26:36 2002