Skip to main content.
home | support | download

Back to List Archive

Re: no files being indexed using http,,,,,

From: <Kevin.Fay(at)not-real.CommerceQuest.com>
Date: Thu Apr 26 2001 - 14:45:57 GMT
David,

I would think the indexer would be more "error friendly" in diagnosing 
this issue. I'll get perl on my web server and see what happens.

Thanks for the help!

-Kevin






David Wood <dwood@inter.nl.net>
Sent by: swish-e@sunsite.berkeley.edu
04/26/01 10:38 AM
Please respond to dwood

 
        To:     Multiple recipients of list <swish-e@sunsite.berkeley.edu>
        cc: 
        Subject:        [SWISH-E] Re: no files being indexed using http,,,,,


Hi Kevin,

Yes, you do need Perl.

How are you running the indexer?  I would think you should be getting lots 

of error messages that should be helping you to diagnose this.

cheers,

David


At 16:24 26-04-01, Kevin.Fay@CommerceQuest.com wrote:

>David,
>
>I made the change and received the same results: "no files indexed". Any 
>ideas? Do I need perl on my web server to do a http index?
>
>Thanks,
>
>Kevin
>
>
>
>David Wood <dwood@inter.nl.net>
>
>04/26/01 10:15 AM
>
>         To:        Kevin.Fay@CommerceQuest.com, Multiple recipients of 
> list <swish-e@sunsite.berkeley.edu>
>         cc:
>         Subject:        Re: [SWISH-E] no files being indexed using 
http,,,,,
>
>
>Hi Kevin,
>
>Neither TmpDir nor SpiderDirectory should be a URL I think.  They should 
be
>local directory names like C:/jakarta-tomcat-3.2...
>
>cheers,
>
>David
>
>
>At 15:59 26-04-01, Kevin.Fay@CommerceQuest.com wrote:
>
> >First of all, thanks to everyone for their help. The support for this
> >product has been tremendous. This tells me alot about SWISH-E.
> >
> >Now, I built my http index successfully. Only problem is that the 
output
> >said.........."No files indexed". I have my http server running 
(tomcat).
> >I'm attempting to index all the files under my root directory:
> >"http://localhost:8080/".
> >
> >Any ideas on why I couldn't index any files. I heard that if your 
trying
> >an http index, that you need to have a running version of perl. Is this 

> true?
> >
> >Here is my config file I'm working with:
> >
> ># DIRECTIVES COMMON to  HTTP and FILESYSTEM METHODS
> >###################################################
> ># WINDOWS USERS NOTE:
> >#        Specify ALL files and directory paths in the
> >#        the config file using the forward slash, as
> >#        in /thisdirectory.
> >#
> >###################################################
> >
> >IndexDir http://localhost:8080/
> ># For the FileSystem Method:
> ># This is a space-separated list of files and
> ># directories you want indexed. You can specify
> ># more than one of these directives.
> >#
> ># For the HTTP Method:
> ># Use the URL's from which you want the spidering
> ># to begin.
> ># NOTE: use  hmtl files rather than  directories
> ># for this method.
> >
> >IndexFile  C:/jakarta-tomcat-3.2/databases/index.swish
> ># This is what the generated index file will be.
> >
> >IndexName "McNichols index"
> >IndexDescription "This is an index to test bug fixes in swish."
> >IndexPointer "http://localhost:8080/index.html"
> >IndexAdmin "Kevin Fay, (kevin.fay@commercequest.com)"
> ># Extra information you can include in the index file.
> >
> >MetaNames first author
> ># List of all the meta names used in the file to index, must be on one 
line.
> ># If no metanames DO NOT deleted the line.
> >
> >IndexReport 3
> ># This is how detailed you want reporting. You can specify numbers
> ># 0 to 3 - 0 is totally silent, 3 is the most verbose.
> >
> >FollowSymLinks yes
> ># Put "yes" to follow symbolic links in indexing, else "no".
> >
> >#UseStemming no
> ># Put yes to apply word stemming algorithm during indexing,
> ># else no. See the manual for info about stemming. Default is
> ># no.
> >
> >#PropertyNames author
> ># List of meta tags names that can be retrieved with the -p option.
> ># Index size increases as by the formula in the manual.
> ># Comment out if no PropertyNames. Case insensitive
> >
> >IgnoreTotalWordCountWhenRanking yes
> ># Put yes to ignore the total number of words in the file
> ># when calculating ranking. Often better with merges and
> ># small files. Default is no.
> >
> >#ReplaceRules remove "ghill/"
> >#ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html"
> >#ReplaceRules replace "/ghill" "moreghillmore"
> ># ReplaceRules allow you to make changes to file pathnames
> ># before they're indexed. This directive uses C library
> ># regex.h regular expressions.
> ># NOTE: do not use replace <string> "" to remove a string,
> ># use remove <string> instead - you might get a core dump otherwise.
> >
> >#MinWordLimit 5
> ># Set the minimum length of an indexable word. Every shorter word
> ># will not be indexed.
> ># Commenting out the line will give the defaults
> >
> >#MaxWordLimit 10
> ># Set the maximum length of an indexable word. Every longer word
> ># will not be indexed.
> ># Commenting out the line will give the defaults
> >
> >WordCharacters 
abcdefghijklmnopqrstuvwxyz\&#;0123456789.@|,-'"[](~!@$%^{}_+?
> ># WORDCHARS is a string of characters which SWISH permits to
> ># be in words. Any strings which do not include these characters
> ># will not be indexed. You can choose from any character in
> ># the following string:
> >#
> ># abcdefghijklmnopqrstuvwxyz0123456789_\|/-+=?!@$%^'"`~,.[]{}()
> >#
> ># Note that if you omit "0123456789&#;" you will not be able to
> ># index HTML entities. DO NOT use the asterisk (*), lesser than
> ># and greater than signs (<), (>), or colon (:).
> >#
> ># Including any of these four characters may cause funny things to 
happen.
> ># NOTE: Do not escape \ nor " and they cannot be the first letter in 
the
> >string
> ># Commenting out the line will give the defaults
> >
> >#BeginCharacters m"
> ># Of the characters that you decide can go into words, this is
> ># a list of characters that words can begin with. It should be
> ># a subset of (or equal to) WordCharacters
> ># Same rule of syntax as for WordCharacters
> >
> >#EndCharacters \"\
> ># Of the characters that you decide can go into words, this is
> ># a list of characters that words can begin with. It should be
> ># a subset of (or equal to) WordCharacters
> ># Same rule of syntax as for WordCharacters
> >
> >#IgnoreLastChar
> ># Array that contains the char that, if considered valid in the middle 
of
> ># a word need to be disreguarded when at the end. It is important to 
also
> ># set the given char's in the ENDCHARS array, otherwise the word will 
not
> ># be indexed because considered invalid.
> ># Commenting out the line will give the defaults
> ># NOTE: if " is the first char in the string it needs to be escaped 
with \
> ># Do not escape otherwise
> >
> >#IgnoreFirstChar
> ># Array that contains the char that, if considered valid in the middle 
of
> ># a word need to be disreguarded when at the beginning. This was to 
solve
> ># the problem of parenthesis when there is no space between ( and the
> ># beginning of the word.
> ># Remember to add the char's to the BEGINCHARS list also.
> ># Commenting out the line will give the defaults
> ># NOTE: if " is the first char in the string it needs to be escaped 
with \
> ># Do not escape otherwise
> >
> >IgnoreLimit 50 1000
> ># This automatically omits words that appear too often in the files
> ># (these words are called stopwords). Specify a whole percentage
> ># and a number, such as "80 256". This omits words that occur in
> ># over 80% of the files and appear in over 256 files. Comment out
> ># to turn of auto-stopwording.
> >
> >#IgnoreWords SwishDefault
> ># The IgnoreWords option allows you to specify words to ignore.
> ># Comment out for no stopwords; the word "SwishDefault" will
> ># include a list of default stopwords. Words should be separated by 
spaces
> ># and may span multiple directives.
> >
> >IndexComments 0
> ># This option allows the user decide if to index the comments in the 
files
> ># default is 1. Set to 0 if comment indexing is not required.
> >
> >##################################
> ># DIRECTIVES for FILESYSTEMS ONLY
> ># Comment out if using HTTP
> >###################################
> >
> >#IndexOnly .html .q
> ># Only files with these suffixes will be indexed.
> >
> >#NoContents .gif .xbm .au .mov .mpg .pdf .ps
> ># Files with these suffixes will not have their contents indexed -
> ># only their file names will be indexed.
> >
> >#FileRules pathname contains .*dir1
> >#FileRules filename contains # % ~ .bak .orig .old old.
> >#FileRules title contains construction example pointers
> >#FileRules directory contains .htaccess
> >#FileRules filename is index
> ># Files matching the above criteria will *not* be indexed.
> ># The patter matching uses the C library regex.h
> >
> >################################
> ># DIRECTIVES for HTTP METHOD ONLY
> ># Comment out if using FILESYSTEM
> >##################################
> >
> >MaxDepth 5
> >#(default 5)  This defines how many links the spider should
> >#follow before stopping.  A value of 0 configures the spider to
> >#traverse all links
> >
> >Delay 60
> >#(default 60)  The number of seconds to wait between issuing
> >#requests to a server.
> >
> >TmpDir http://localhost:8080/temp
> >#(default /var/tmp)  The location of a writeable temp directory
> >#on your system.  The HTTP access method tells the Perl helper to place
> >#its files there.
> >
> >SpiderDirectory http://localhost:8080/spider
> >#(default ./)  The location of the Perl helper
> >#script.  Remember, if you use a relative directory, it is relative to
> >#your directory when you run SWISH-E, not to the directory that SWISH-E
> >#is in.
> >
> >#EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
> >#EquivalentServer http://sunsite.berkeley.edu:2000
> >http://sunsite.berkeley.edu
> >#(default nothing)  This allows you to deal with
> >#servers that use respond to multiple DNS names.  Each line should have
> >#a list of all the method/names that should be considered equivalent.
> >#If you have multiple directives, each one defines its own set of 
equivalent
> >#servers.
>
>
Received on Thu Apr 26 15:07:46 2001