On Wed, Jun 25, 2003 at 12:53:40PM -0500, Cleveland@mail.winnefox.org wrote:
> > Ah, yes. remove your blank lines. A blank line separates sections
> > based on the user agent.
>
> Hm. Still not working. I looked at cnn's robots.txt and I noticed they
> didn't have multiple directories listed. Just /directory or file.html,
> not /directory/directory/file.html. Is it ok to put sub folders?
Yes, see: http://www.robotstxt.org/wc/norobots.html
> Also, I
> have the spider only looking at www.oshkoshpubliclibrary.org/citydirs.
> Could that be the problem?
No. Try this out (changing "site" to be your site). This is from the
WWW::RobotRules man page.
moseley@bumby:~/apache$ cat r.pl
my $site = 'http://bumby';
use WWW::RobotRules;
my $rules = WWW::RobotRules->new('MOMspider/1.0');
use LWP::Simple qw(get);
{
my $url = "$site/robots.txt";
my $robots_txt = get $url;
print "==========\n$robots_txt=========\n";
$rules->parse("$site/robots.txt", $robots_txt) if defined $robots_txt;
}
my @tests = (
"$site/citydirs/1857/1857full.pdf",
"$site/citydirs/1857/1857fullx.pdf",
);
for ( @tests ) {
print $rules->allowed( $_ ) ? "allowed" : "not allowed";
print " $_\n";
}
And here's the output:
moseley@bumby:~/apache$ perl r.pl
==========
User-agent: *
Disallow: /citydirs/1857/1857full.pdf
=========
not allowed http://bumby/citydirs/1857/1857full.pdf
allowed http://bumby/citydirs/1857/1857fullx.pdf
--
Bill Moseley
moseley@hank.org
Received on Wed Jun 25 18:15:47 2003