At 10:32 AM 02/22/02 -0800, Darryl Friesen wrote:
>I've run across an interesting problem. I'm using the spider.pl (with "-S
>prog" of course) to index out Intranet, which seems to work fine except
>swish-e happily wanders off and indexes our main library web pages as well.
>Our Intranet runs on the SSL port of the same machine (i.e. Intranet URLS
>are all https://library.usask.ca/ and our public pages are
Argh! A week or so ago (for some reason I can't remember) I changed to
using the URI->host_port call, which breaks that. Sorry.
>Is there a quick and dirty way to stop this? I have a common set of
>callback functions for test_url and filter_content that I use for both the
>Intranet and our main server (and a few others) so I can't just "return 0"
>if the URL does not start with "https".
Why not? You could say something like:
return 0 if $uri->scheme eq 'https'
&& $uri->canonical->authority eq 'library.usask.ca';
That's not tested, but you get the idea.
>I thought spider.pl would treat the URLs as being different actually, but it
>looks as if it's comparing host, not scheme/port (although I haven't really
>looked at the code; maybe I should).
Yes, it should. It's currently only looking at the host:port.
I'll get an update out in the next 24 hours, but the test_url function is a
good place to do that kind of check. That's why I put those callback
functions in -- so people could fix my bugs ;)
Thanks for the report!
Received on Fri Feb 22 22:42:34 2002