Skip to main content.
home | support | download

Back to List Archive

Spidering PDF's with Swish

From: <AHatton(at)not-real.oxfam.org.uk>
Date: Wed Feb 06 2002 - 21:33:18 GMT
Apologies for the double post, but concerned this may not have posted
properly first time around.

re; issue with Spidering PDF's with Swish.
This problem is quite urgent so would really appreciate any help anyone can
give, otherwise will have to look at other solutions :-<

rgds
Andrew

Oxfam works with others to find lasting solutions to poverty and suffering.

------------------------------------------------------------------
Dear All,

Hope you can help with a problem we have indexing PDF files using the
spider (http) method of indexing.


Whilst using swish-e -S http.. etc works fine for indexing other content
we can't get it to index PDF files.


We have installed the correct pdf to text filters, and have got swish to
index PDF's successfully using the file system method, ie swish-e -S
fs...etc
so the filters are working OK.


We are puzzled where to look next, and hope someone can point us in the
right direction.


If it helps, below I have included the following information about our
setup.
+ The output of running the command with -s http flag
+ The output of running the command with via the filesystem method
+ The shell script we are using to pass input to the filter programme
+ The conf file for http method

We are using version swish-e 2.0


Many thanks
Andrew Hatton




+Output from http method+



midgard# /usr/local/bin/swish-e -S http -v3 -c /usr/local/etc/ahtestconf >
/tmp/pdf.log
midgard# more /tmp/pdf.log
Indexing Data Source: "HTTP-Crawler"
Indexing http://midgard.poptel.org.uk/test..
retrieving http://midgard.poptel.org.uk/test (0)...
retrieving http://midgard.poptel.org.uk/test/ (0)...
(9 words)
retrieving http://midgard.poptel.org.uk/test/ThisisOxfam.pdf (1)...
(2 words)


Removing very common words...
no words removed.
Writing main index...
Computing hash table ...
Writing header ...
Writing index entries ...
Writing stopwords ...
10 unique words indexed.
Writing file index...
Writing file list ...
Writing file offsets ...
Writing MetaNames ...
Writing offsets (2)...
2 files indexed.
Running time: 5 seconds.




+ Output from temp file using filesysten method (showing this works)+


midgard# more /tmp/pdf.log
Indexing Data Source: "File-System"
Indexing /usr/local/htdocs..
Checking dir "/usr/local/htdocs"...
ThisisOxfam.pdf (1529 words)
fd.pdf (6426 words)
test.html (9 words)




+ Shell Script +


midgard# cd /usr/local/htdocs/swish-filter
midgard# more pdf-filter.sh


#!/bin/sh
#/usr/X11R6/bin/pdftotext -q $1 -
#/usr/X11R6/bin/pdftotext "$1" - 2>/dev/null
/usr/X11R6/bin/pdftotext "$1" -


+ The conf file itself +


midgard# more ahtestconf
# DIRECTIVES COMMON to HTTP and FILESYSTEM METHODS
###################################################
# WINDOWS USERS NOTE:
# Specify ALL files and directory paths in the
# the config file using the forward slash, as
# in /thisdirectory.
#
###################################################


IndexDir http://midgard.poptel.org.uk/test
# For the FileSystem Method:
# This is a space-separated list of files and
# directories you want indexed. You can specify
# more than one of these directives.
#
# For the HTTP Method:
# Use the URL's from which you want the spidering
# to begin.
# NOTE: use hmtl files rather than directories
# for this method.


IndexFile /usr/local/htdocs/search/index0
# This is what the generated index file will be.


#IndexName "Improvement index"
#IndexDescription "This is an index to test bug fixes in swish."
#IndexPointer "http://sunsite/~ghill/swish/index.html"
#IndexAdmin "Giulia Hill, (ghill@library.berkeley.edu)"
# Extra information you can include in the index file.


MetaNames first author
# List of all the meta names used in the file to index, must be on one
line.
# If no metanames DO NOT deleted the line.
# New in 2.0 -> automatic option will extract metanames dinamically
# eg:
# MetaNames automatic


IndexReport 1
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.


FollowSymLinks yes
# Put "yes" to follow symbolic links in indexing, else "no".


#UseStemming no
# Put yes to apply word stemming algorithm during indexing,
# else no. See the manual for info about stemming. Default is
# no.


#PropertyNames author
# List of meta tags names that can be retrieved with the -p option.
# Index size increases as by the formula in the manual.
# Comment out if no PropertyNames. Case insensitive


IgnoreTotalWordCountWhenRanking yes
# Put yes to ignore the total number of words in the file
# when calculating ranking. Often better with merges and
# small files. Default is no.


#ReplaceRules remove "ghill/"
#ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html"
#ReplaceRules replace "/ghill" "moreghillmore"
# ReplaceRules allow you to make changes to file pathnames
# before they're indexed. This directive uses C library
# regex.h regular expressions.
# NOTE: do not use replace <string> "" to remove a string,
# use remove <string> instead - you might get a core dump otherwise.


#MinWordLimit 5
# Set the minimum length of an indexable word. Every shorter word
# will not be indexed.
# Commenting out the line will give the defaults


#MaxWordLimit 5
# Set the maximum length of an indexable word. Every longer word
# will not be indexed.
# Commenting out the line will give the defaults


#WordCharacters abcdefghijklmnopqrstuvwxyz\&#;0123456789.@|,-'"[](~!@$%^{}
_+?
# WORDCHARS is a string of characters which SWISH permits to
# be in words. Any strings which do not include these characters
# will not be indexed. You can choose from any character in
# the following string:
#
# abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()
#
# Note that if you omit "0123456789&#;" you will not be able to
# index HTML entities. DO NOT use the asterisk (*), lesser than
# and greater than signs (<), (>), or colon (:).
#
# Including any of these four characters may cause funny things to happen.
# NOTE: Do not escape \ nor " and they cannot be the first letter in the
string
# Commenting out the line will give the defaults


#BeginCharacters m"
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters


#EndCharacters \"\
# Of the characters that you decide can go into words, this is
# a list of characters that words can begin with. It should be
# a subset of (or equal to) WordCharacters
# Same rule of syntax as for WordCharacters


#IgnoreLastChar
# Array that contains the char that, if considered valid in the middle of
# a word need to be disreguarded when at the end. It is important to also
# set the given char's in the ENDCHARS array, otherwise the word will not
# be indexed because considered invalid.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise


#IgnoreFirstChar
# Array that contains the char that, if considered valid in the middle of
# a word need to be disreguarded when at the beginning. This was to solve
# the problem of parenthesis when there is no space between ( and the
# beginning of the word.
# Remember to add the char's to the BEGINCHARS list also.
# Commenting out the line will give the defaults
# NOTE: if " is the first char in the string it needs to be escaped with \
# Do not escape otherwise


IgnoreLimit 50 1000
# This automatically omits words that appear too often in the files
# (these words are called stopwords). Specify a whole percentage
# and a number, such as "80 256". This omits words that occur in
# over 80% of the files and appear in over 256 files. Comment out
# to turn of auto-stopwording.


#IgnoreWords SwishDefault
# The IgnoreWords option allows you to specify words to ignore.
# Comment out for no stopwords; the word "SwishDefault" will
# include a list of default stopwords. Words should be separated by spaces
# and may span multiple directives.
# New in 2.0. File option reads stopwords from an external file
# eg:
# IgnoreWords File:filename


IndexComments 0
# This option allows the user decide if to index the comments in the files
# default is 1. Set to 0 if comment indexing is not required.


#TranslateCharacters string1 string2
# This option allows to index the characters in string1 to be indexed
# as the characteres in string2.
# This is done after htnl entities are converted
# This option is useful in languages like spanish, french, ...
# eg:
# TranslateCharacters _<E1> -a
# This will index a_b as a-b and <E1>mo as amo


##################################
# DIRECTIVES for FILESYSTEMS ONLY
# Comment out if using HTTP
###################################


#IndexOnly .html .q
# Only files with these suffixes will be indexed.


#NoContents .gif .xbm .au .mov .mpg .pdf .ps
# Files with these suffixes will not have their contents indexed -
# only their file names will be indexed.


#FileRules pathname contains .*dir1
#FileRules filename contains # % ~ .bak .orig .old old.
#FileRules title contains construction example pointers
#FileRules directory contains .htaccess
#FileRules filename is index
# Files matching the above criteria will *not* be indexed.
# The patter matching uses the C library regex.h


################################
# DIRECTIVES for HTTP METHOD ONLY
# Comment out if using FILESYSTEM
##################################


MaxDepth 5
#(default 5) This defines how many links the spider should
#follow before stopping. A value of 0 configures the spider to
#traverse all links


Delay 1
#(default 60) The number of seconds to wait between issuing
#requests to a server.


#TmpDir /home/ghill/swishRon/
#(default /var/tmp) The location of a writeable temp directory
#on your system. The HTTP access method tells the Perl helper to place
#its files there.


SpiderDirectory /usr/local/bin/
#(default ./) The location of the Perl helper
#script. Remember, if you use a relative directory, it is relative to
#your directory when you run SWISH-E, not to the directory that SWISH-E
#is in.


#EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
#EquivalentServer http://sunsite.berkeley.edu:2000
http://sunsite.berkeley.edu
#(default nothing) This allows you to deal with
#servers that use respond to multiple DNS names. Each line should have
#a list of all the method/names that should be considered equivalent.
#If you have multiple directives, each one defines its own set of
equivalent
#servers.


#FilterDir <path-to-filterprog/>
#FileFilter <file-ext> <filter-program>


FilterDir /usr/local/htdocs/swish-filter/
FileFilter .pdf pdf-filter.sh





Oxfam works with others to find lasting solutions to poverty and suffering.

Oxfam GB is a member of Oxfam International, a company limited by guarantee and registered in England No. 612172.
Registered office: 274 Banbury Road, Oxford OX2 7DZ.
Registered charity No. 202918.

Visit the web site at http://www.oxfam.org.uk
Received on Wed Feb 6 21:34:53 2002