At 01:11 PM 10/29/02 -0800, Shen Yang wrote:
>Now that I am ready to index my site, a question occured to me: how
>spider.pl knows when to stop crawling? Does the spider only index pages
>of a given server and/or domain or does the spider.pl follow all the
>links that it encounters, including links to sites in other servers
>and/or domains?
It spiders one server, which is defined by a host name and a port number.
>For instance, if my site in the domain ny.frb.org has
>links to pages on www.firstgov.org, does that mean that the spider.pl
>will also index pages in first.gov domain?
No.
The configuration file is a Perl array, with each element of the array
being a separate server config (represented by a perl hash. This allows
you to index multiple servers. See:
http://swish-e.org/dev/docs/spider.html#CONFIGURATION_FILE
For a given server, you can use the "same_hosts" setting to say that
www.frb.org and frb.org are the same servers.
There's currently no way to say index www.frb.org but follow links to a
list of other servers from www.frb.org.
--
Bill Moseley
mailto:moseley@hank.org
Received on Tue Oct 29 21:48:01 2002