I'm continuing to have a problem with filters. I'm in a windows 2000/XP
environment and am using the spider to crawl my site which contains pdf
files. Pdfinfo and pdftotext are installed and working from the command
line.
For each pdf file indexed I receive the following error:
Returned 0
- Using DEFAULT (HTML2) parser - Error: May not be a PDF file (continuing
anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
(no words indexed)
I modified swishspider at line 144 to print the contents to stderr and
receive the following output for the meta tags for the document. As you can
see below I believe the meta tags from the output from pdfinfo are not being
formed properly. I just can't figure out why.
- Using DEFAULT (HTML2) parser - (23 words)
retrieving http://localhost/affidavit.pdf (1)...
spider 2376 [C:/Inetpub/Indexes/Temp/swishspider@3084
http://localhost/affidavit
.pdf]
<html>
<head>
">eta name="author" content="jamin
">eta name="creationdate" content="04/23/03 10:40:15
">eta name="creator" content="Affidavit final.doc - Microsoft Word
">eta name="encrypted" content="no
">eta name="file_size" content="31838 bytes
">eta name="moddate" content="04/23/03 10:47:36
">eta name="optimized" content="yes
">eta name="page_size" content="612 x 792 pts (letter)
">eta name="pages" content="1
">eta name="pdf_version" content="1.4
">eta name="producer" content="Acrobat PDFWriter 5.0 for Windows NT
">eta name="tagged" content="no
">eta name="title" content="Affidavit final.doc
</head>
<body>
<pre>
The contents of the document appear to be OK from what I can see.
Have I missed something obvious or do you need me to post the configuration
files as well.
Rick
Richard Klingensmith
MSU Human Resources Information Systems
1407 S. Harrison Road Ste. 40
East Lansing, MI 48823
(517) 432-4636 ext. 155
klingensmith@hr.msu.edu
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Mon Jul 28 21:20:20 2003