Skip to main content.
home | support | download

Back to List Archive

Re: swish-e spider does not go beyond index.html

From: Giulia Hill <ghill(at)not-real.library.berkeley.edu>
Date: Fri Oct 16 1998 - 16:06:45 GMT
If the links that you have in the index.html are not from the same server
that you have in the indexDir, that is if they are not from
http://47.48.16.63 they will be direguarded. If you want to also index
from other servers you will have to use the EquivalentServer directive.

Giulia

On Fri, 16 Oct 1998, Christian Stalberg wrote:

> Thank you for everyone's advice re. the 'Bad directives' error. We
> recompiled swish-e using METHOD=HTTP and it will run; however, it does not
> index anything more than just the starting page index.html , i.e. the spider
> is not working. Thank you for your insight. 
> 
> Here is a copy of our config file:
> 
> 
> # DIRECTIVES COMMON to  HTTP and FILESYSTEM METHODS
> ###################################################
> > IndexDir http://47.48.16.63/beta/index.html

> # For the FileSystem Method:
> # This is a space-separated list of files and
> # directories you want indexed. You can specify
> # more than one of these directives.
> # For the HTTP Methppod:
> # Use the URL's from which you want the spidering
> # to begin.
> 
> IndexFile /home/u6/genesis/httpd/cgi-bin/swish-e.dir/newindex2
> # This is what the generated index file will be.
> 
> IndexName "Improvement index"
> IndexDescription "This is an index to test bug fixes in swish." 
> IndexPointer "http://sunsite/~ghill/swish/index.html"
> IndexAdmin "Giulia Hill, (ghill@library.berkeley.edu)"
> # Extra information you can include in the index file.
> 
> MetaNames first author keywords description abstract
> # List of all the meta names used in the file to index, must be on one line.
> # If no metanames DO NOT deleted the line.
> 
> IndexReport 3
> # This is how detailed you want reporting. You can specify numbers
> # 0 to 3 - 0 is totally silent, 3 is the most verbose.
> 
> FollowSymLinks yes
> # Put "yes" to follow symbolic links in indexing, else "no".
> 
> ReplaceRules remove "ghill/"
> ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html"
> #ReplaceRules replace "/ghill" "moreghillmore"
> # ReplaceRules allow you to make changes to file pathnames
> # before they're indexed. This directive uses C library
> # regex.h regular expressions.
> # NOTE: do not use replace <string> "" to remove a string,
> # use remove <string> instead - you might get a core dump otherwise.
> 
> #MinWordLimit 5
> # Set the minimum length of an indexable word. Every shorter word
> # will not be indexed.
> # Commenting out the line will give the defaults
> 
> #MaxWordLimit 5
> # Set the maximum length of an indexable word. Every longer word
> # will not be indexed.
> # Commenting out the line will give the defaults
> 
> #WordCharacters
> abcdefghijklmnopqrstuvwxyz\&#;0123456789.@|,-'"[](~!@$%^{}_+?
> # WORDCHARS is a string of characters which SWISH permits to
> # be in words. Any strings which do not include these characters
> # will not be indexed. You can choose from any character in
> # the following string:
> #
> # abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()
> #
> # Note that if you omit "0123456789&#;" you will not be able to
> # index HTML entities. DO NOT use the asterisk (*), lesser than
> # and greater than signs (<), (>), or colon (:).
> #
> # Including any of these four characters may cause funny things to happen.
> # NOTE: Do not escape \ nor " and they cannot be the first letter in the
> string
> # Commenting out the line will give the defaults
> 
> #BeginCharacters m"
> # Of the characters that you decide can go into words, this is
> # a list of characters that words can begin with. It should be
> # a subset of (or equal to) WordCharacters
> # Same rule of syntax as for WordCharacters
> 
> #EndCharacters \"\
> # Of the characters that you decide can go into words, this is
> # a list of characters that words can begin with. It should be
> # a subset of (or equal to) WordCharacters
> # Same rule of syntax as for WordCharacters
> 
> IgnoreLastChar 
> # Array that contains the char that, if considered valid in the middle of 
> # a word need to be disreguarded when at the end. It is important to also
> # set the given char's in the ENDCHARS array, otherwise the word will not
> # be indexed because considered invalid.
> # Commenting out the line will give the defaults
> # NOTE: if " is the first char in the string it needs to be escaped with \
> # Do not escape otherwise
> 
> IgnoreFirstChar 
> # Array that contains the char that, if considered valid in the middle of
> # a word need to be disreguarded when at the beginning. This was to solve
> # the problem of parenthesis when there is no space between ( and the
> # beginning of the word.
> # Remember to add the char's to the BEGINCHARS list also.
> # Commenting out the line will give the defaults
> # NOTE: if " is the first char in the string it needs to be escaped with \
> # Do not escape otherwise
> 
> IgnoreLimit 50 1000
> # This automatically omits words that appear too often in the files
> # (these words are called stopwords). Specify a whole percentage
> # and a number, such as "80 256". This omits words that occur in
> # over 80% of the files and appear in over 256 files. Comment out
> # to turn of auto-stopwording.
> 
> #IgnoreWords SwishDefault
> # The IgnoreWords option allows you to specify words to ignore.
> # Comment out for no stopwords; the word "SwishDefault" will
> # include a list of default stopwords. Words should be separated by spaces
> # and may span multiple directives.
> 
> IndexComments 0
> # This option allows the user decide if to index the comments in the files
> # default is 1. Set to 0 if comment indexing is not required.
> 
> ##################################
> # DIRECTIVES for FILESYSTEMS ONLY 
> # Comment out if using HTTP
> ###################################
> 
> # IndexOnly .html .htm .txt
> # Only files with these suffixes will be indexed.
> 
> # NoContents .gif .xbm .au .mov .mpg .pdf .ps .jpg .pl
> # Files with these suffixes will not have their contents indexed -
> # only their file names will be indexed.
> 
> #FileRules pathname contains .*dir1
> #FileRules filename contains # % ~ .bak .orig .old old.
> #FileRules title contains construction example pointers
> #FileRules directory contains .htaccess
> #FileRules filename is index
> # Files matching the above criteria will *not* be indexed.
> # The pattern matching uses the C library regex.h 
> 
> ################################
> # DIRECTIVES for HTTP METHOD ONLY
> # Comment out if using FILESYSTEM
> ##################################
> 
> MaxDepth 5
> #(default 5)  This defines how many links the spider should
> #follow before stopping.  A value of 0 configures the spider to
> #traverse all links
> 
> Delay 60
> #(default 60)  The number of seconds to wait between issuing
> #requests to a server.
> 
> TmpDir /home/u9/directory/
> #(default /var/tmp)  The location of a writeable temp directory
> #on your system.  The HTTP access method tells the Perl helper to place
> #its files there.
> 
> SpiderDirectory /home/u6/genesis/httpd/cgi-bin/swish-e.dir/src/
> #(default ./)  The location of the Perl helper
> #script.  Remember, if you use a relative directory, it is relative to
> #your directory when you run SWISH-E, not to the directory that SWISH-E
> #is in.
> 
> EquivalentServer 
> #(default nothing)  This allows you to deal with
> #servers that use respond to multiple DNS names.  Each line should have
> #a list of all the method/names that should be considered equivalent.
> #If you have multiple directives, each one defines its own set of equivalent
> #servers.
> 
> 
Received on Fri Oct 16 09:14:02 1998