Skip to main content.
home | support | download

Back to List Archive

Re: problem indexing PDFs - "Error (0): PDF file is damaged"

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Dec 17 2003 - 01:16:22 GMT
On Tue, Dec 16, 2003 at 02:20:11PM -0800, Brad_Horstkotte@capgroup.com wrote:

> I've been poking around trying to figure out how to get PDF indexing to
> work, and haven't had any luck - I'm running into the same problem which
> was discussed on this thread (null characters in the PDF files being
> replaced with line feed characters, and later on the PDF is seen as
> invalid):
> 
> http://swish-e.org/archive/4511.html
> 
> Has this problem been fixed?

I think so.  But that's not to say it isn't happening somewhere else in
the chain.  When using a filter with -S prog swish-e doesn't replace
nulls with \n.  But, swish-e is reading the entire file into memory and
then writing it out to a temp file before calling the filter.  So
something could be happening there -- or mabye it's how the spider is
fetching it.

> The PDFs convert fine when running _pdf2html.pl from the command line on
> the file, but fail when converted via the spider.

Well, what I'd do is edit _pdf2html.pl and do something like:

   system("copy $file c:\test.pdf");

assuming that works on windows.  That will allow you to see if indeed
the copy is the same as the original (you can check by file size -- I'm
not sure what Windows provides for comparing files).

The bit of debugging I'd do is run the spider to just fetch the pdf file
and save its output to a file.  Look at the first few lines of the file
and see if the content-length is what you expect.  No, that might not
work.  Windows uses \r\n on disk but inside perl and C the data only
contains \n so the content-length might be different.

I have also written the output from the spider to a file, used an editor
to remove the header lines and then compare the files.  It's a pain.

> I saw SWISH::Filter mentioned as an alternative, but so far have avoided it
> since I'm a perl dolt, and it looked like less of a turnkey alternative.

No it's more turnkey.  If you use the "default" mode it should know how
to decode it:

$ /usr/local/lib/swish-e/spider.pl default http://localhost/apache/test.pdf | head
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
Path-Name: http://localhost/apache/test.pdf
Content-Length: 12593
Last-Mtime: 1064946675
Document-Type: HTML*

<html>    
<head>
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">


Here it is on Windows (sorry for the wrapping):

E:\SWISH-E>perl lib/swish-e/spider.pl default
"http://bumby/apache/test.pdf" | head
lib/swish-e/spider.pl: Reading parameters from 'default'
Can't use keep-alive: conn_cache method not available

Summary for: http://bumby/apache/test.pdf
Total Bytes: 12,579  (12579.0/sec)
 Total Docs:      1  (1.0/sec)
 Unique URLs:      1  (1.0/sec)
 Path-Name: http://bumby/apache/test.pdf
 Content-Length: 12579
 Last-Mtime: 1064946675
 Document-Type: HTML*

 <html>
 <head>
 <meta name="author" content=" ">
 <meta name="creationdate" content="03/21/03 21:42:23">
 <meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
 

And even piping to swish:

E:\SWISH-E>perl lib/swish-e/spider.pl default
"http://bumby/apache/test.pdf" | s
wish-e -S prog -i stdin
lib/swish-e/spider.pl: Reading parameters from 'default'
Can't use keep-alive: conn_cache method not available

Summary for: http://bumby/apache/test.pdf
Total Bytes: 12,579  (12579.0/sec)
 Total Docs:      1  (1.0/sec)
 Unique URLs:      1  (1.0/sec)
 Indexing Data Source: "External-Program"
 Indexing "stdin"
 Removing very common words...
 no words removed.
 Writing main index...
 Sorting words ...
 Sorting 813 words alphabetically
 Writing header ...
 Writing index entries ...
   Writing word text: Complete
     Writing word hash: Complete
       Writing word data: Complete
       813 unique words indexed.
       4 properties sorted.
       1 file indexed.  12579 total bytes.  2299 total words.
       Elapsed time: 00:00:01 CPU time: 00:00:00
       Indexing done!



-- 
Bill Moseley
moseley@hank.org
Received on Wed Dec 17 01:16:31 2003