Hi,
The command to start the indexing is:
swish-e -c swish-e.conf -S prog
The config file is:
@ servers = ({
skip => 0, # skip spidering server flag
base_url => 'http://aaa.company.com/intranet/index.html',
credentials => 'not-a:chance',
agent => 'swish-e spider http://swish-e.org/',
email => 'Bogus@company.com',
# limit to only .html files
delay_sec => 0, # Delay in seconds between requests
max_time => 60, # Max time to spider in minutes
max_files => 10000, # Max Unique URLs to spider
max_indexed => 10000, # Max number of files to send to swish for indexing
max_depth => 3, # the max number of layers to spider
keep_alive => 1, # enable keep alives requests
use_cookies => 1, # True will keep cookie jar
validate_links => 1, # Solution to the single webserver?
});
1;
Note that the max_depth is now low to avoid waiting too long for the
indexing, but that the link across servers is found within these 3
levels. Also the link should be found well within the max_files and
max_indexed.
Cas
On 5/11/06, Peter Karman <peter@peknet.com> wrote:
> Since you didn't post your config files, we have no way of knowing if
> the problem is there.
>
> Chances are good that you need to list all 3 base URLs in your config
> file, since the spider likely sees them as different hosts and doesn't
> follow them. If it did follow, by default, a link to http://google..../
> could prove disastrous. ;)
Received on Mon May 15 02:23:59 2006