Skip to main content.
home | support | download

Back to List Archive

Re: More Trouble with Filters

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Jul 28 2003 - 21:57:43 GMT
On Mon, Jul 28, 2003 at 02:19:54PM -0700, Klingensmith, Rick wrote:
> I'm continuing to have a problem with filters. I'm in a windows 2000/XP
> environment and am using the spider to crawl my site which contains pdf
> files. Pdfinfo and pdftotext are installed and working from the command
> line. 

That's good thing to know.

> For each pdf file indexed I receive the following error:

> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> 
> Error: Couldn't find trailer dictionary
> 
> Error: Couldn't read xref table

Those are all messages coming from xpdf.  So the next step is to modify 
whatever is calling pdfinfo/pdftotext and see how it's being called.

> I modified swishspider at line 144 to print the contents to stderr and
> receive the following output for the meta tags for the document. As you can
> see below I believe the meta tags from the output from pdfinfo are not being
> formed properly. I just can't figure out why.

> <html>
> 
> <head>
> 
> ">eta name="author" content="jamin
>
> ">eta name="creationdate" content="04/23/03 10:40:15
>
> ">eta name="creator" content="Affidavit final.doc - Microsoft Word
>
> ">eta name="encrypted" content="no
>
> ">eta name="file_size" content="31838 bytes
>
> ">eta name="moddate" content="04/23/03 10:47:36

That's weird output.  Looks like it's dropping some characters and 
there's an extra blank line.  Maybe DOS line endings are causing a 
problem?

Hum, ok so you are using -S http with swishspider.  Are you using the 
SWISH::Filter module(s) to decode the pdf?  Or are you using a 
FileFilter directive (although I'm not sure that works).

If using the SWISH::Filter setup then I just added a use lib line to the 
swishspider file to find the modules and ran:

moseley(at)not-real.bumby:~/apache$ ./swishspider swish http://localhost/apache/test.pdf

moseley@bumby:~/apache$ head swish.contents
<html>    
<head>
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
<meta name="encrypted" content="no">
<meta name="file_size" content="32194 bytes">
<meta name="moddate" content="Fri Mar 21 21:42:23 2003">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">


Can you duplicate that under Windows?



-- 
Bill Moseley
moseley@hank.org
Received on Mon Jul 28 21:57:52 2003