Skip to main content.
home | support | download

Back to List Archive

[Fwd: what does a spider index?]

From: John Almberg <jalmberg(at)not-real.identry.com>
Date: Thu Jan 23 2003 - 20:24:40 GMT
Am I right in thinking that a spider only finds files that are linked to the base document?

I'm using the two configuration files below to spider a site. The site is a mixture of .phtml (php) files and .htm files. However, I can't seem to get the spider to crawl the .htm files, which are definitely linked to the base document. Actually, *directly* from the base document.

Any ideas?

-- John

#swish.conf
IndexDir ./spider.pl
SwishProgParameters spider.conf
DefaultContents HTML
IndexContents HTML .htm .html .phtml

------------------

#spider.conf:
@servers = (
    {
        base_url    => 'http://www.smithtowngospeltabernacle.org/index23.phtml',
        email       => 'jalmberg@identry.com',

        # limit to only .phtml files
        test_url    => sub { $_[0]->path =~ /\.(phtml|shtml|html|htm)$/ },

        delay_min   => .0001,     # Delay in minutes between requests
        max_time    => 10,        # Max time to spider in minutes
        max_files   => 100,       # Max Unique URLs to spider
        max_indexed => 20,        # Max number of files to send to swish for indexing
        keep_alive  => 1,         # enable keep alives requests
    },
);    






-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~
Identry, LLC
www.identry.com
jalmberg@identry.com
Received on Thu Jan 23 20:24:54 2003