Skip to main content.
home | support | download

Back to List Archive

Re: [Fwd: what does a spider index?]

From: John Almberg <jalmberg(at)not-real.identry.com>
Date: Thu Jan 23 2003 - 21:53:55 GMT
Oh whoops . . . found my error. The max_indexed/max_files parameters 
kept me from getting to the .htm files. How embarassing!

-- John

John Almberg wrote:

>Am I right in thinking that a spider only finds files that are linked to the base document?
>
>I'm using the two configuration files below to spider a site. The site is a mixture of .phtml (php) files and .htm files. However, I can't seem to get the spider to crawl the .htm files, which are definitely linked to the base document. Actually, *directly* from the base document.
>
>Any ideas?
>
>-- John
>
>#swish.conf
>IndexDir ./spider.pl
>SwishProgParameters spider.conf
>DefaultContents HTML
>IndexContents HTML .htm .html .phtml
>
>------------------
>
>#spider.conf:
>@servers = (
>    {
>        base_url    => 'http://www.smithtowngospeltabernacle.org/index23.phtml',
>        email       => 'jalmberg@identry.com',
>
>        # limit to only .phtml files
>        test_url    => sub { $_[0]->path =~ /\.(phtml|shtml|html|htm)$/ },
>
>        delay_min   => .0001,     # Delay in minutes between requests
>        max_time    => 10,        # Max time to spider in minutes
>        max_files   => 100,       # Max Unique URLs to spider
>        max_indexed => 20,        # Max number of files to send to swish for indexing
>        keep_alive  => 1,         # enable keep alives requests
>    },
>);    
>
>
>
>
>
>
>  
>

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~
Identry, LLC
www.identry.com
jalmberg@identry.com
Received on Thu Jan 23 21:54:18 2003