Skip to main content.
home | support | download

Back to List Archive

Indexing pdf files

From: Kemp Randy-W18971 <Randy.L.Kemp(at)not-real.motorola.com>
Date: Tue Jul 31 2001 - 16:58:38 GMT
I can't for the life of me get my pdffiles to index.  My executables are in
/usr/local/ and they are work individually, including pdftotext.  Could
someone please help me with the filters on the Solaris sparc 5.6 platform
with swishe 2.0.5?

-----------------------> executables for pdftotext and swish-e
<---------------------------------------------------------------------
ee110:/usr/local> ls
bin           conf          lib           perl          swish-e
cmold.d       ecg           pdftotext     psionic       swish-search

I am running version 2.0.5 of swish.e

-----------------------------> pdf file
<---------------------------------------------------------------------------
-------
My pdf file is located in ee110:/usr2/apache/htdocs/pdffiles> ls
requirements.pdf

-----------------------------> config file at 
e110:/usr2/ecadtesting/swishe-index> ls
index.swish      search.log       swisheconf.conf

My config file is

# Sample SWISH configuration file

# Global Networks Technical Support, support@gobalnetworks.com, 5/10/96



#IndexDir /usr/home/globalne/usr/local/etc/httpd/htdocs/
IndexDir /usr2/apache/htdocs/pdffiles/



# This is a space-separated list of files and

# directories you want indexed. You can specify

# more than one of these directives.

# Be sure to change globalne to be your Server login name.



IndexFile /usr2/ecadtesting/swishe-index/index.swish

# This is what the generated index file will be.



IndexName "PCS Web Page Index"

IndexDescription "This is a full index of the PCS web site."

IndexPointer "http://ee110.ecg.csg.mot.com:8000/cgi-bin/search.cgi"

IndexAdmin "PCS Technical Support (Randy.L.Kemp@motorola.com)"

# Extra information you can include in the index file.

# You probably want to change the Global Networks references.



IndexOnly .html .htm .txt .gif .xbm .au .mov .mpg

# Only files with these suffixes will be indexed.



IndexReport 3

# This is how detailed you want reporting. You can specify numbers

# 0 to 3 - 0 is totally silent, 3 is the most verbose.



FollowSymLinks yes

# Put "yes" to follow symbolic links in indexing, else "no".



NoContents .gif .xbm .au .mov .mpg

# Files with these suffixes will not have their contents indexed -

# only their file names will be indexed.



#ReplaceRules replace "/usr/home/globalne/usr/local/etc/httpd/htdocs"
"http://www.globalnetworks.com"
ReplaceRules replace "/usr2/apache/htdocs"
"http://ee110.ecg.csg.mot.com:8000"


# ReplaceRules allow you to make changes to file pathnames

# before they're indexed.

# Be sure to change globalne to be your Server login name.



FileRules pathname contains admin testing demo trash construction
confidential

FileRules filename is index.html

FileRules filename contains # % ~ .bak .orig .old old.

FileRules title contains construction example pointers

FileRules directory contains .htaccess

# Files matching the above criteria will *not* be indexed.



IgnoreLimit 50 100

# This automatically omits words that appear too often in the files

# (these words are called stopwords). Specify a whole percentage

# and a number, such as "80 256". This omits words that occur in

# over 80% of the files and appear in over 256 files. Comment out

# to turn of auto-stopwording.



IgnoreWords SwishDefault

# The IgnoreWords option allows you to specify words to ignore.

# Comment out for no stopwords; the word "SwishDefault" will

# include a list of default stopwords. Words should be separated by spaces

# and may span multiple directives.

FilterDir /usr2/ecadtesting/shellscripts/
FileFilter .pdf pdf-filter.sh

------------------> Text results with pdf (html docs will work ok in
directory htdocs) <------------------------------------------------------
My test results are:

ee110:/usr2/ecadtesting/shellscripts> ls
dailystats.sh    ncftpput.sh      rkgraph001.sh    webalizer.sh
http-analyze.sh  pdf-filter.sh    swishe.sh
Checking dir "/usr2/apache/htdocs/pdffiles/"...

Removing very common words...
336 words removed.
0 words removed not in common words array:

Writing main index...
Computing hash table ...
Writing header ...
Writing index entries ...
Writing stopwords ...
no unique words indexed.
Writing file index...
Writing file list ...
Writing file offsets ...
Writing MetaNames ...
Writing offsets (2)...
no files indexed.
Running time: Less than a second.
Indexing done!
ee110:/usr2/ecadtesting/shellscripts> 

ee110:/usr2/ecadtesting/shellscripts> more swishe.sh
/usr/local/swish-e -c /usr2/ecadtesting/swishe-index/swisheconf.conf
Received on Tue Jul 31 16:59:10 2001