Skip to main content.
home | support | download

Back to List Archive

Re: Getting the right files indexed the right way

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jan 29 2004 - 05:40:48 GMT
On Wed, Jan 28, 2004 at 08:07:42PM -0800, Rob de Santos AFANA wrote:
> 
> Yes, but it simplifies things considerably to use it (robots.txt).  In
> this case, I'll just have to let swish-e have a go at the gallery itself
> and see how that goes.  Then, I'll just watch and see what the other
> spiders on the net do. 

Think twice about placing anything in robots.txt that isn't linked from 
anywhere.  I have used fake entries in robots.txt and then seen 404 
request for those entries.  Not everyone plays by the rules.

Like Dave says, you can disable the use of robots.txt parsing in
spider.pl, but you then use WWW::RobotRules directly in your config
file.  See perldoc WWW::RobotRules.  You could read in your robots.txt
file, modify as needed, and then parse with the WWW::RobotRules module. 
Then in the spider's test_url() call-back function:

    return unless $rules->allowed( $url );

You can also run more than one program at a time, so you could say:

    IndexDir spider.pl ./fetch_images.pl

where spider.pl uses SwishSpiderConfig.pl (by default) and 
fetch_images.pl is a modified DirTree.pl program to return the names of 
the image files.


-- 
Bill Moseley
moseley@hank.org
Received on Wed Jan 28 21:40:50 2004