Skip to main content.
home | support | download

Back to List Archive

[swish-e] spider.pl, how not to spider file access protocol?

From: Erik van Duren <erik.van.duren(at)not-real.snow.nl>
Date: Fri Nov 02 2007 - 10:35:11 GMT
Hi there,

I have been looking all over for an answer to the following question, but can't
seem to find it.

We are using spider.pl provided with SWISH-E 2.4.3 to spider our intranet.
Because there are a lot of documents with links pointing to local files the
proces takes a lot of time spidering local filesystems, which we don't want. In
the end only data that can be found on our intranet server will show up in the
index, so spidering the local filesystems doesn't really add anything. Because
of this we want to stop spider.pl from spidering local filesystems.

The only option I have found that might have been able to do the job is
"test_url". To test this option for this specific use I have tried the effect
of the following setting:
    test_url    => sub { $_[0]->canonical !~ /file:\/\//i },
Without result however:

==========

Indexing Data Source: "External-Program"
Indexing "/web/bin/swish-e-2.4.3/prog-bin/spider.pl"
External Program found: /web/bin/swish-e-2.4.3/prog-bin/spider.pl
/web/bin/swish-e-2.4.3/prog-bin/spider.pl: Reading parameters from
'/web/bin/swish-e-2.4.3/prog-bin/SwishSpiderConfig.pl'

 -- Starting to spider: http://acme.com/personal.html --
?Testing 'test_url' user supplied function #1 'http://acme.com/personal.html'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1
'http://acme.com/personal.html'
+Passed all 1 tests for 'test_response' user supplied function
>> +Fetched 0 Cnt: 1 GET  http://acme.com/personal.html  200 OK text/html 210
parent: depth:0
?Testing 'test_url' user supplied function #1
'http://acme.com/personal/foobar/'
+Passed all 1 tests for 'test_url' user supplied function
! Found 1 links in http://acme.com/personal.html

?Testing 'filter_content' user supplied function #1
'http://acme.com/personal.html'
+Passed all 1 tests for 'filter_content' user supplied function

Summary for: http://acme.com/personal.html
     Connection: Close:   1  (1.0/sec)
Connection: Keep-Alive:   1  (1.0/sec)
        Off-site links:   1  (1.0/sec)
           Total Bytes: 210  (210.0/sec)
            Total Docs:   1  (1.0/sec)
           Unique URLs:   2  (2.0/sec)
             text/html:   1  (1.0/sec)

Bad Links:

On page: http://acme.com/personal.html
 file:///acme/com/personal//foobar  404 File `/acme/com/personal//foobar' does
not exist
 http://acme.com/personal/foobar/  404 Not found

===============

Can someone provide me with a way to stop spider.pl from spidering local files?

Thank you very much!

Erik.
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Nov 2 06:35:12 2007