Skip to main content.
home | support | download

Back to List Archive

RE: Indexing Off Site Links

From: Antonio Barrera <abarrera(at)not-real.Princeton.EDU>
Date: Thu Sep 16 2004 - 19:43:39 GMT
Thanks,

I think that is precisely, what I'm looking for!
 
Antonio

-----Original Message-----
From: Thomas Dowling [mailto:tdowling@ohiolink.edu] 
Sent: Thursday, September 16, 2004 3:37 PM
To: abarrera@Princeton.EDU
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Indexing Off Site Links

Antonio Barrera wrote:

>I've seen some threads about similar problems to the one I'm facing, 
>yet many were older solutions.
>
>My base url is: http://library.princeton.edu .  However, there are 
>links to other servers which I would want to index, without indexing the
entire site.
>Prior to indexing I have some knowledge of servers/directories, I do 
>want to search.
>
>For instance:  I may want to index,
>http://www.princeton.edu/~rbsc/exhibitions/online.html but not all of 
>www.princeton.edu.  Or I may want to do 
>http://libweb5.princeton.edu/ejournals/by_title_zd.asp but not all of 
>libweb5.princeton.edu.
>
>
>  
>

Somewhere along the way, I picked up this syntax in the spider.conf file:

=============
my %SecondarySite = (
  base_url      => 'http://foo.ohiolink.edu/documentation/',
  email         => 'tdowling@ohiolink.edu',
  delay_sec     => 1,

  test_url      => sub {
    my $uri = shift;

    # Skip requesting files that are probably not text
    return if $uri->path =~ m[\.(?:gif|jpg|jpeg|png|css)$]i;

    # Limit spidering by path
    # We only want the /documentation/ directory
    return unless $uri->path =~ /documentation/;

    return 1;  # otherwise, ok to search
  },

);

@servers = (\%MainSite, \%SecondarySite); =============


--
Thomas Dowling
tdowling@ohiolink.edu
Received on Thu Sep 16 12:44:23 2004