Skip to main content.
home | support | download

Back to List Archive

Security of swishspider

From: Bill Moseley <moseley(at)>
Date: Wed Aug 14 2002 - 22:45:01 GMT
Ok, so I'm not a fan of the -S http method in swish.  For one thing I don't
like the idea of running a separate perl program for every URL fetched.  It
would be better to use the libwww C library, or use -S prog where the perl
program is only called once.  Someday it might just vanish.

One thing that has always concerned me with -S http is that swish uses a
system() call to call perl (and the swishspider script).  In that system
call swish passes the URL to fetch, and that URL can come from fetched web
pages.  And that is passing *user data* through the shell.  That's a big

Swish tries to protect things by single-quoting (double-quote for Windows)
the URL passed to swishspider, and swishspider escapes any single quote
found in a URL:
        $link =~ s/'/%27/g;

Which is breaking the rule that you should allow known OK chars instead of
blocking known bad chars.

I can't think of an easy exploit off the top of my head, but I would not
doubt that someone could.  Which means if you spider pages that are not
under your control you might be giving the web page's author shell access
to your machine.

Also, when running under Windows note that the double-quote is used so that
escaping of the single-quote is useless.  Windows security (or lack of it)
is another issue.

So, although this breaks my own rule not to mess with the -S http code, I
made some changes.  I'll need to know if there are any platforms (excluding
Windows) where this might break things.

1) when running under *Windows*, swish now calls system with:

       perl ./swishspider *tmpdir* *url*

   the changes are that it used to call perl.exe, and
   That's probably not secure, since *url* is being passed through the

2) when not running windows, it does a fork/exec/wait and just calls
   the swishspider program.  This means the shebang line is used to
   determine the location of the perl program.

3) for some reason it seems like -S http was indexing every file fetched,
   regardless of content-type.  Now, only text/* types are indexed, and
   text/html are spidered.  

   I looked back through CVS and could see how
   at one time pages that were not text/* were set as NoContents.
   NoContents doesn't index the contents, but indexes just the
   path name as the content.  That seems less than useful.  

   Anyway, if you want more control over this stuff you should
   be using -S prog with (should be using it anyway).

4) swishspider no longer does $link =~ s/'/%27/g; in an attempt to
   patch a security hole.  It's not needed under unix since using
   exec() to run the program (skipping the shell).

5) since I was in there, -S http now store the last-modified dates
   fetched from docs.  This is too bad, as that was a convincing 
   argument for using -S prog and instead of -S http.

(ok, doesn't fork for every URL, easy to adjust content types
that are indexed, can use keep-alive to reduce load on the server and fetch
docs faster, doesn't use system() when running under Windows, doesn't fetch
entire docs when wrong content-type, and probably has much better code for
dealing with HTTP in general.)

Bill Moseley
Received on Wed Aug 14 22:48:40 2002