Am I right in thinking that a spider only finds files that are linked to the base document?
I'm using the two configuration files below to spider a site. The site is a mixture of .phtml (php) files and .htm files. However, I can't seem to get the spider to crawl the .htm files, which are definitely linked to the base document. Actually, *directly* from the base document.
Any ideas?
-- John
#swish.conf
IndexDir ./spider.pl
SwishProgParameters spider.conf
DefaultContents HTML
IndexContents HTML .htm .html .phtml
------------------
#spider.conf:
@servers = (
{
base_url => 'http://www.smithtowngospeltabernacle.org/index23.phtml',
email => 'jalmberg@identry.com',
# limit to only .phtml files
test_url => sub { $_[0]->path =~ /\.(phtml|shtml|html|htm)$/ },
delay_min => .0001, # Delay in minutes between requests
max_time => 10, # Max time to spider in minutes
max_files => 100, # Max Unique URLs to spider
max_indexed => 20, # Max number of files to send to swish for indexing
keep_alive => 1, # enable keep alives requests
},
);
--
~~~~~~~~~~~~~~~~~~~~~~~~~~
Identry, LLC
www.identry.com
jalmberg@identry.com
Received on Thu Jan 23 20:24:54 2003