Skip to main content.
home | support | download

Back to List Archive

Re: Incorrect behavior for swishspider script

From: Ron Samuel Klatchko <rsk(at)not-real.corpmail.brightmail.com>
Date: Sat Apr 08 2000 - 04:42:18 GMT
On Fri, 7 Apr 2000, Andrew Ho wrote:
> The swishspider script that comes with SWISH-E has a slight error, it will
> not look for or report links in any document that has a content-type that
> is not exactly "text/html". Unfortunately, this means that a page with
> this perfectly valid HTTP 1.1 header:

That's a known error.  There are patches for that in
  http://sunsite.berkeley.edu/SWISH-E/Patches/spider
  http://sunsite.berkeley.edu/SWISH-E/Patches/spider2

> On another note, perhaps there should be a configuration option to set the
> full path AND FILENAME of the spidering program, such that the spidering
> program does not need to be explicitly called "swishspider" (if, for
> example, I wanted to write an intelligent spider of my own that knows the
> structure of my site).

Although that is a good idea, you can still do what you want by naming
your intelligent spider swishspider.  Not perfect, but it gets the job
done.

> Or at the very least some documentation about the interaction between the
> spider program and the SWISH-E indexing program.

swishspider always generates a .response file.  It has one or two lines.
The first is the HTTP status code.  The second is the content type if the
status code is 200, the new URL if the status is a redirect (30x) or
nothing if the status code is something else.

If the status code is 200, swishspider also generates a .contents file
with the contents of the URL.

Finally is the status code is 200 and the content type is text/html,
swishspider generates a .links file with all the href's from <A> tags.

moo
------------------------------------------------------------
        Ron Samuel Klatchko - Senior Software Jester
            Brightmail Inc - rsk@brightmail.com
Received on Sat Apr 8 00:43:52 2000