Antonio Barrera wrote:
>I've seen some threads about similar problems to the one I'm facing, yet
>many were older solutions.
>
>My base url is: http://library.princeton.edu . However, there are links to
>other servers which I would want to index, without indexing the entire site.
>Prior to indexing I have some knowledge of servers/directories, I do want to
>search.
>
>For instance: I may want to index,
>http://www.princeton.edu/~rbsc/exhibitions/online.html but not all of
>www.princeton.edu. Or I may want to do
>http://libweb5.princeton.edu/ejournals/by_title_zd.asp but not all of
>libweb5.princeton.edu.
>
>
>
>
Somewhere along the way, I picked up this syntax in the spider.conf file:
=============
my %SecondarySite = (
base_url => 'http://foo.ohiolink.edu/documentation/',
email => 'tdowling@ohiolink.edu',
delay_sec => 1,
test_url => sub {
my $uri = shift;
# Skip requesting files that are probably not text
return if $uri->path =~ m[\.(?:gif|jpg|jpeg|png|css)$]i;
# Limit spidering by path
# We only want the /documentation/ directory
return unless $uri->path =~ /documentation/;
return 1; # otherwise, ok to search
},
);
@servers = (\%MainSite, \%SecondarySite);
=============
--
Thomas Dowling
tdowling@ohiolink.edu
Received on Thu Sep 16 12:40:16 2004