Skip to main content.
home | support | download

Back to List Archive

Re: swish-e only spiders the server it started on

From: Guy de Vanny <guyd(at)not-real.tpg.com.au>
Date: Fri May 19 2006 - 14:47:59 GMT
Hi Cas,
I can't comment on the authentication problem, but I can comment on the 
spidering.

I have recently faced the same problem with multiple interlinked 
intranet sites, such as your aaa.com, bbb.com and ccc.com, where I had 
an entry page on aaa.com, and links between the various servers. I 
needed to start on aaa.com, and follow links between the servers, but 
not to other servers. In my case, and I think in yours, putting multiple 
servers in base_url did not work, as bbb.com etc did not have a real 
entry point - they are really just extra parts of the aaa.com intranet. 
Also the same_hosts did not address the issue, as the servers had 
different content.

I solved my problem by changing the spider.pl program. I am using 
swish-e 2.4.3. I added a configuration option "follow_url" to the spider 
config file, and the necessary code to spider.pl to handle it.  The 
config file then had in it:
    ...
    base_url => 'http://aaa.com/index.html',
    follow_url => ['http://bbb.com', 'http://ccc.com'],
     ...  etc ...

The spidering then works, following links between the various servers, 
backwards and forwards as it finds them, but not to outside servers. 
Duplicates links are identified  as expected. Searching works as normal, 
returning from the index links to any of the servers.

Bill Moseley touched on this topic in a post 
http://swish-e.org/archive/2004-04/7286.html on 08 April 2004, where he 
suggests how to skip the test that allows processing only if the link 
being tested is on the base_url server.  My code keeps the test, but 
extends it, allowing processing to continue if it is the current 
base_url server, or any of the servers in the follow_url option.

Following Bill's suggestion for spider.pl and using max_depth may be 
enough for you. My code is available if you want it.

Guy



Bill Moseley wrote:
> On Tue, May 16, 2006 at 09:49:28AM +0200, Cas Tuyn wrote:
>   
>>>  base_url => [qw! http://aaa.company.com/intranet/index.html
>>> http://bbb.company.com/ http://ccc.company.com/ !],
>>>
>>> And see what happens tonight.
>>>       
>> but ran into authorization problems on the 2nd and 3rd server,
>> although all three servers are single sign-on. This is what the IT
>> admin replied:
>>
>> 2 Things seem to be of importance here:
>> - We have 3 different servers with different content in the same path
>> (e.g. /index.html on aaa.company.com, bbb.company.org, ccc.company.com)
>> - The authentication used is equal on all 3 systems but seems to be used
>> for only for the 1st URL in the list. The 2nd and 3rd host return "401
>> Unauthorized"
>>     
>
> This is all the spider does:
>
>         my @urls = ref $s->{base_url} eq 'ARRAY' ? @{$s->{base_url}} :( $s->{base_url});
>         for my $url ( @urls ) {
>
>             # purge config options -- used when base_url is an array
>             $valid_config_options{$_} ||  delete $s->{$_} for keys %$s;
>
>             $s->{base_url} = $url;
>             process_server( $s );
>         }
>     }
>
> So it's the same config for each one.  Maybe the auth is reset
> somehow during the run.
>
>   
>> I just reread the whole documentation, but could not find anything
>> about authentication on multiple servers. Who hasa similar setup and a
>> solution?
>>     
>
> Seems like it would not be very hard for you to debug.  Set up a few
> test servers with auth (just create three domains in your hosts file
> all pointing to the same local web server) and have the spider print
> out the request and response headers -- or even just throw in a few
> print statements into the script to print out when auth is being set.
>
>
>   
Received on Fri May 19 07:48:11 2006