On Wed, Jan 28, 2004 at 08:07:42PM -0800, Rob de Santos AFANA wrote:
> Yes, but it simplifies things considerably to use it (robots.txt). In
> this case, I'll just have to let swish-e have a go at the gallery itself
> and see how that goes. Then, I'll just watch and see what the other
> spiders on the net do.
Think twice about placing anything in robots.txt that isn't linked from
anywhere. I have used fake entries in robots.txt and then seen 404
request for those entries. Not everyone plays by the rules.
Like Dave says, you can disable the use of robots.txt parsing in
spider.pl, but you then use WWW::RobotRules directly in your config
file. See perldoc WWW::RobotRules. You could read in your robots.txt
file, modify as needed, and then parse with the WWW::RobotRules module.
Then in the spider's test_url() call-back function:
return unless $rules->allowed( $url );
You can also run more than one program at a time, so you could say:
IndexDir spider.pl ./fetch_images.pl
where spider.pl uses SwishSpiderConfig.pl (by default) and
fetch_images.pl is a modified DirTree.pl program to return the names of
the image files.
Received on Wed Jan 28 21:40:50 2004