Indexing pdf files

From: Klingensmith, Rick
Date: Fri Jul 25 2003 - 13:35:47 GMT
I'm having a problem getting filters to index .pdf files in a Windows 2000 /
XP Pro environment. I'm using the spider (yes I've read that the -S prog is
more efficient but one step at a time) and the xpdf pdftotext program. The
output I'm getting from running swish-e index is:


C:\SWISH-E>swish-e -S http -c conf/siteindex.config -v 3

Parsing config file 'conf/siteindex.config'

Parsing config file 'C:/Swish-E/conf/Settings.config'


Warning: Configuration setting for TmpDir 'C:/Inetpub/Indexes/Temp' will be

ridden by environment setting 'C:\DOCUME~1\klingen2\LOCALS~1\Temp'

Indexing Data Source: "HTTP-Crawler"

Indexing "http://localhost"

Returned 0

retrieving http://localhost (0)...

Returned 0

 - Using DEFAULT (HTML2) parser -  (23 words)

retrieving http://localhost/affidavit.pdf (1)...

Returned 0

 - Using DEFAULT (HTML2) parser - Error: Couldn't open file


c:\SWISH-E\filter-bin\ Failed close on pipe to pdfinfo for

ub\Indexes\Temp\swishspider@3840.contents-: 256 at
c:\SWISH-E\filter-bin\_pdf2ht line 54.

 (no words indexed)


Removing very common words...

no words removed.

Writing main index...

Sorting words ...

Sorting 19 words alphabetically

Writing header ...

Writing index entries ...

  Writing word text: Complete

  Writing word hash: Complete

  Writing word data: Complete

19 unique words indexed.

4 properties sorted.

2 files indexed.  4335 total bytes.  23 total words.

Elapsed time: 00:00:03 CPU time: 00:00:03

Indexing done!



My configuration file looks like this:


# Include our site-wide configuration settings:


#IncludeConfigFile D:/ProgramFiles/Swish-E/conf/Settings.config

IncludeConfigFile C:/Swish-E/conf/Settings.config


# Specify the URL (or URLs) to index:


IndexDir http://localhost


# If a server goes by more than one name you can use this directive:


# EquivalentServer



MaxDepth 10


# The number of seconds to wait between issuing

# requests to a server.  The default is 60 seconds.


Delay 1


TmpDir C:/Inetpub/Indexes/Temp


# The "http" method uses a perl helper program to fetch each document

# from the web called "swishspider" and is included in the src directory of

# the swish-e distribution.


SpiderDirectory C:/Swish-E


# Put the index files in the Inetpub/Indexes directory

#IndexFile D:/Inetpub/Indexes/SiteIndex.New.index

IndexFile C:/Inetpub/Indexes/SiteIndex.index


# Use the file filter to index pdf files

#FileFilter .pdf c:/SWISH-E/filter-bin/ "'%p' -"

FileFilter .pdf c:/SWISH-E/filter-bin/

FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe "'%p'"


# Filter Directory

#FilterDir C:/SWISH-E/filters


# end of SiteIndex Config file



My Settings.config file looks like this:


# These settings tell swish what defines a word.


WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-


IgnoreFirstChar .-

IgnoreLastChar  .-


# Finally, resulting words must begin/end with one

# of the characters listed here


BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789

EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789


# Turn this on for a slight performance improvement

#FollowSymLinks yes


IndexReport 2


#IgnoreWords file: D:/ProgramFiles/Swish-E/conf/stopwords/english.txt

IgnoreWords file: C:/Swish-E/conf/stopwords/english.txt


TranslateCharacters :ascii7:


BumpPositionCounterCharacters |.


As you can see it's pretty standard and all the html pages on my site are
indexed with no problem. I have ActiveState Perl installed on the system.
Any ideas where I've gone wrong? 




Richard Klingensmith

MSU Human Resources Information Systems

1407 S. Harrison Road Ste. 40

East Lansing, MI 48823

(517) 432-4636 ext. 155


Received on Fri Jul 25 13:35:59 2003