Thanks,
I think that is precisely, what I'm looking for!
Antonio
-----Original Message-----
From: Thomas Dowling [mailto:tdowling@ohiolink.edu]
Sent: Thursday, September 16, 2004 3:37 PM
To: abarrera@Princeton.EDU
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Indexing Off Site Links
Antonio Barrera wrote:
>I've seen some threads about similar problems to the one I'm facing,
>yet many were older solutions.
>
>My base url is: http://library.princeton.edu . However, there are
>links to other servers which I would want to index, without indexing the
entire site.
>Prior to indexing I have some knowledge of servers/directories, I do
>want to search.
>
>For instance: I may want to index,
>http://www.princeton.edu/~rbsc/exhibitions/online.html but not all of
>www.princeton.edu. Or I may want to do
>http://libweb5.princeton.edu/ejournals/by_title_zd.asp but not all of
>libweb5.princeton.edu.
>
>
>
>
Somewhere along the way, I picked up this syntax in the spider.conf file:
=============
my %SecondarySite = (
base_url => 'http://foo.ohiolink.edu/documentation/',
email => 'tdowling@ohiolink.edu',
delay_sec => 1,
test_url => sub {
my $uri = shift;
# Skip requesting files that are probably not text
return if $uri->path =~ m[\.(?:gif|jpg|jpeg|png|css)$]i;
# Limit spidering by path
# We only want the /documentation/ directory
return unless $uri->path =~ /documentation/;
return 1; # otherwise, ok to search
},
);
@servers = (\%MainSite, \%SecondarySite); =============
--
Thomas Dowling
tdowling@ohiolink.edu
Received on Thu Sep 16 12:44:23 2004