I'm having trouble getting SWISH-e to work with IIS unless Directory
Browsing is turned on, and I don't want to do that.
SWISH-e runs on the same server as IIS. The desired content in a virtual
folder, /docs. Underneath that folder are several additional folders at the
next level containing pdf & doc files. If I run SWISH-e with Directory
Browsing turned on for the /docs virtual folder, everything indexes as
expected. However we don't want to allow wide open access to browsing those
directories. If add a default document, nobrowse.php to /docs and the each
folder below it, indexing fails. It gets to the nobrowse.php document and
stops. And of course, if I turn Directory Browsing off, indexing fails.
How can I get SWISH-e to index the files in /docs?
Thanks!
Details:
SWISH-e v2.43
Windows Server 2003 Enterprise x64, IIS 6.0
command line: c:\SWISH-e\swish-e.exe -e -v 3 -c C:\swish-e\swish.conf -S
prog 1>C:\swish-e\swish_stdout.txt 2>C:\swish-e\swish_stderr.txt
swish.conf:
IndexDir /perl/bin/perl.exe
IndexFile /inetpub/wwwroot/swish/index.swish-e
SwishProgParameters
c:/swish-e/lib/swish-e/spider.plc:/swish-e/lib/swish-e/SwishSpiderConfig.pl
IndexOnly .html .htm .pdf .doc
SwishSpiderConfig.pl:
@servers = (
{
use_default_config => 1,
email => 'me@mysite.com',
base_url => 'https://www.mysite.com/docs',
test_url => \&test_url,
test_response => \&response_sub,
# delay_sec should be commented out in production
delay_sec => 0,
max_time => 90, # Max time to spider in minutes - changed
19Oct10 lj
max_wait_time => 180, # Max time in seconds for spider to wait for
data to be returned - added 19Oct10 lj
max_size => 0, # Override max size of 5mb
keep_alive => 1,
#This is OK if we are indexing our own site
ignore_robots_file => 1,
#Use this one in production
debug => 'skipped,errors',
},
);
sub test_url {
my ( $uri, $server ) = @_;
# return 1; # Ok to index/spider
# return 0; # No, don't index or spider
# make sure that the path is limited to the swish path
#print STDERR "Checking $uri->path\r\n";
return 0 if $uri->path !~ m[^/docs]i;
#return 0 if $uri->path =~ m[^/docs/save]i;
# ignore any of these file types
if ($uri->path =~
/\.(css|gif|jpeg|jpg|png|asp|php|ppt|pptx|mp4|wmv|asx|msi|arf)?$/i ) {
#print STDERR "Skipping $uri->path, this file type is excluded\r\n";
return 0;
}
return 1;
}
# This is used in HEAD request to test the content type ahead of time
sub response_sub {
my ( $uri, $server, $response, $content_ref ) = @_;
my $content_type = $response->content_type;
return 1 if $content_type =~ m!^text/!; # allow all text (assume we
don't want to filter)
return 1 if $content_type =~ m[^application/msword]i; # allow word
doc files
return 1 if $content_type =~ m[^application/pdf]i; # allow pdf
files
return 0;
};
1;
nobrowse.php:
<html>
<head>
<title>404 - NOT FOUND</title>
</head>
<body>
<?php echo '404 - NOT FOUND'; ?>
</body>
</html>
--
Sent by Lyle Jensen
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Nov 5 14:50:17 2010