Ok, so I'm not a fan of the -S http method in swish. For one thing I don't
like the idea of running a separate perl program for every URL fetched. It
would be better to use the libwww C library, or use -S prog where the perl
program is only called once. Someday it might just vanish.
One thing that has always concerned me with -S http is that swish uses a
system() call to call perl (and the swishspider script). In that system
call swish passes the URL to fetch, and that URL can come from fetched web
pages. And that is passing *user data* through the shell. That's a big
Swish tries to protect things by single-quoting (double-quote for Windows)
the URL passed to swishspider, and swishspider escapes any single quote
found in a URL:
$link =~ s/'/%27/g;
Which is breaking the rule that you should allow known OK chars instead of
blocking known bad chars.
I can't think of an easy exploit off the top of my head, but I would not
doubt that someone could. Which means if you spider pages that are not
under your control you might be giving the web page's author shell access
to your machine.
Also, when running under Windows note that the double-quote is used so that
escaping of the single-quote is useless. Windows security (or lack of it)
is another issue.
So, although this breaks my own rule not to mess with the -S http code, I
made some changes. I'll need to know if there are any platforms (excluding
Windows) where this might break things.
1) when running under *Windows*, swish now calls system with:
perl ./swishspider *tmpdir* *url*
the changes are that it used to call perl.exe, and swishspider.pl.
That's probably not secure, since *url* is being passed through the
2) when not running windows, it does a fork/exec/wait and just calls
the swishspider program. This means the shebang line is used to
determine the location of the perl program.
3) for some reason it seems like -S http was indexing every file fetched,
regardless of content-type. Now, only text/* types are indexed, and
text/html are spidered.
I looked back through CVS and could see how
at one time pages that were not text/* were set as NoContents.
NoContents doesn't index the contents, but indexes just the
path name as the content. That seems less than useful.
Anyway, if you want more control over this stuff you should
be using -S prog with spider.pl (should be using it anyway).
4) swishspider no longer does $link =~ s/'/%27/g; in an attempt to
patch a security hole. It's not needed under unix since using
exec() to run the program (skipping the shell).
5) since I was in there, -S http now store the last-modified dates
fetched from docs. This is too bad, as that was a convincing
argument for using -S prog and spider.pl instead of -S http.
(ok, spider.pl doesn't fork for every URL, easy to adjust content types
that are indexed, can use keep-alive to reduce load on the server and fetch
docs faster, doesn't use system() when running under Windows, doesn't fetch
entire docs when wrong content-type, and probably has much better code for
dealing with HTTP in general.)
Received on Wed Aug 14 22:48:40 2002