Bill and All,
I'm probably beginning to sound like a flake, but I've got myself very
confused at this point. I've used the following config file and added a bare
use lib line to the swishspider file:
# ----- SiteIndex.config - Spider using "http" method -------
# Use the file filter to index pdf files
#FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl '"%p" -'
#FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe '"%p" -'
# Filter Directory
# end of SiteIndex Config file
Swishspider is in my SWISH-e directory. With this configuration the pdf
files indexed correctly, but I'm still getting the same output on the meta
tags as below in my previous post.
This is the output before the pdf contents:
retrieving http://localhost/affidavit.pdf (1)...
spider 2644 [C:/Inetpub/Indexes/Temp/swishspider@2604
This is the output after the pdf contents with the contents the same as the
previous post below:
- Using DEFAULT (HTML2) parser - (279 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 135 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
135 unique words indexed.
4 properties sorted.
2 files indexed. 4412 total bytes. 302 total words.
Elapsed time: 00:00:03 CPU time: 00:00:03
I thought I was using the SWISH::Filter by default, but now I'm not sure.
When I use the FileFilter directive in the config file I get the errors that
pdf is invalid. Once I commented both lines out at least it indexed the pdf
without error. The FilterDir directive doesn't seem to matter I get the same
output with or without it. I did confirm that the document is being indexed
with a search for words that only appear in the pdf with the correct
My perl/site/lib/swish subdirectory contains filter.pm and
perl/site/lib/swish/filters contain the other filter modules. I'm convinced
this is a simple configuration issue, but my perl knowledge is limited so
debugging has been a problem.
Thanks for the help.
From: Bill Moseley [mailto:firstname.lastname@example.org]
Sent: Monday, July 28, 2003 5:58 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: More Trouble with Filters
On Mon, Jul 28, 2003 at 02:19:54PM -0700, Klingensmith, Rick wrote:
> I'm continuing to have a problem with filters. I'm in a windows 2000/XP
> environment and am using the spider to crawl my site which contains pdf
> files. Pdfinfo and pdftotext are installed and working from the command
That's good thing to know.
> For each pdf file indexed I receive the following error:
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
Those are all messages coming from xpdf. So the next step is to modify
whatever is calling pdfinfo/pdftotext and see how it's being called.
> I modified swishspider at line 144 to print the contents to stderr and
> receive the following output for the meta tags for the document. As you
> see below I believe the meta tags from the output from pdfinfo are not
> formed properly. I just can't figure out why.
> ">eta name="author" content="jamin
> ">eta name="creationdate" content="04/23/03 10:40:15
> ">eta name="creator" content="Affidavit final.doc - Microsoft Word
> ">eta name="encrypted" content="no
> ">eta name="file_size" content="31838 bytes
> ">eta name="moddate" content="04/23/03 10:47:36
That's weird output. Looks like it's dropping some characters and
there's an extra blank line. Maybe DOS line endings are causing a
Hum, ok so you are using -S http with swishspider. Are you using the
SWISH::Filter module(s) to decode the pdf? Or are you using a
FileFilter directive (although I'm not sure that works).
If using the SWISH::Filter setup then I just added a use lib line to the
swishspider file to find the modules and ran:
moseley(at)not-real.bumby:~/apache$ ./swishspider swish http://localhost/apache/test.pdf
moseley@bumby:~/apache$ head swish.contents
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
<meta name="encrypted" content="no">
<meta name="file_size" content="32194 bytes">
<meta name="moddate" content="Fri Mar 21 21:42:23 2003">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">
Can you duplicate that under Windows?
Received on Tue Jul 29 13:20:05 2003