I think the current method is to configure a robots.txt file in the web
main directory (http://www.website.com/robots.txt). Search the web (or the
swish docs, possibly) for "robot exclusion" to get started.
Hope that helps.
Personally I would like to have a little more flexibility than that, since
I want to create multiple indexes of the same site, each with different
sets of pages that are off-limits. Even better would be a way to specify
pages that should be crawled but not indexed (so that deeper pages can
be discovered) based on regex's. So there would be a set of Crawl-Stopping
regexs and a set of DontIndex regexs so it could, say, index every page
on the site with the word "help" or "index" in the url.
Being able to specify multiple starting urls would be cool too, especially
could be stored in a file external to the config file (in addition to
multiple urls in the config file).
At 01:49 PM 4/29/99 -0700, firstname.lastname@example.org wrote:
>Any way to restrict URLs when using http method since you can't use
>I'd like to be able to stop indexing of certian areas. My guess is that
>I will have to modify
>the swishspider perl program. Has anyone else done this already.
>Gulfstream Aerospace Corp.
Received on Thu Apr 29 14:05:03 1999