I've run across an interesting problem. I'm using the spider.pl (with "-S
prog" of course) to index out Intranet, which seems to work fine except
swish-e happily wanders off and indexes our main library web pages as well.
Our Intranet runs on the SSL port of the same machine (i.e. Intranet URLS
are all https://library.usask.ca/ and our public pages are
http://library.usask.ca).
Is there a quick and dirty way to stop this? I have a common set of
callback functions for test_url and filter_content that I use for both the
Intranet and our main server (and a few others) so I can't just "return 0"
if the URL does not start with "https".
I thought spider.pl would treat the URLs as being different actually, but it
looks as if it's comparing host, not scheme/port (although I haven't really
looked at the code; maybe I should).
I'd appreciate any help or suggestions.
- Darryl
----------------------------------------------------------------------
Darryl Friesen, B.Sc., Programmer/Analyst Darryl.Friesen@usask.ca
Education & Research Technology Services, http://gollum.usask.ca/
Information Technology Services Division,
University of Saskatchewan
----------------------------------------------------------------------
"Go not to the Elves for counsel, for they will say both no and yes"
Received on Fri Feb 22 18:34:00 2002