Skip to main content.
home | support | download

Back to List Archive

Re: Spider Design Flaw! HTTP Codes

From: Chris Humphries <ChrisJMH(at)not-real.vermilion99.freeserve.co.uk>
Date: Mon Feb 28 2000 - 12:18:59 GMT
Public Const Const HTTP_STATUS_OK As Integer =                  200
Public Const Const HTTP_STATUS_PARTIAL As Integer =             201
Public Const Const HTTP_STATUS_NO_CONTENT As Integer =          202
Public Const Const HTTP_STATUS_AMBIGUOUS As Integer =           300
Public Const Const HTTP_STATUS_MOVED As Integer =               301
Public Const Const HTTP_STATUS_REDIRECT As Integer =            302
Public Const Const HTTP_STATUS_REDIRECT_METHOD As Integer =     303
Public Const Const HTTP_STATUS_NOT_MODIFIED As Integer =        304
Public Const Const HTTP_STATUS_BAD_REQUEST As Integer =         400
Public Const Const HTTP_STATUS_DENIED As Integer =              401
Public Const Const HTTP_STATUS_PAYMENT_REQ As Integer =         402
Public Const Const HTTP_STATUS_FORBIDDEN As Integer =           403
Public Const Const HTTP_STATUS_NOT_FOUND As Integer =           404
Public Const Const HTTP_STATUS_BAD_METHOD As Integer =          405
Public Const Const HTTP_STATUS_NONE_ACCEPTABLE As Integer =     406
Public Const Const HTTP_STATUS_PROXY_AUTH_REQ As Integer =      407
Public Const Const HTTP_STATUS_CONFLICT As Integer =            408
Public Const Const HTTP_STATUS_GONE As Integer =                409
Public Const Const HTTP_STATUS_AUTH_REFUSED As Integer =        410
Public Const Const HTTP_STATUS_SERVER_ERROR As Integer =        500
Public Const Const HTTP_STATUS_NOT_SUPPORTED As Integer =       501
Public Const Const HTTP_STATUS_BAD_GATEWAY As Integer =         502
Public Const Const HTTP_STATUS_SERVICE_UNAVAIL As Integer =     503
Public Const Const HTTP_STATUS_GATEWAY_TIMEOUT As Integer =     504

Chris Humphries

-----Original Message-----
From:	Ron Samuel Klatchko [SMTP:rsk@corpmail.brightmail.com]
Sent:	Sunday, February 27, 2000 1:54 AM
To:	Multiple recipients of list
Subject:	[SWISH-E] Re: Spider Design Flaw!


A couple of solutions to this problem:

1) Fix it via your web server (most web servers have a way of specifying
the name of the default page).  This solves the problem without any
software development and makes better use of your server and bandwidth
resources.

2) Have the spider handle the tag.  Before writing the response file, see
if this META tag is specified and if so, translate it into an HTTP
redirect.  Write the response with an HTTP permanent rediect (I believe
that's a 304, but please double check that) and then write the new URL on
the next line.

moo

On Sat, 26 Feb 2000, PropheZine Owner wrote:
> As by the number of posts I have sent in you can tell I am experimenting
> with Spidering and also AutoSwish.  Thank you all for your help.
> 
> Here is a design flaw.  I'm not knocking anyone as I think the software is
> wonderful.  I wish I knew "c" and Perl better to offer modifications.
> 
> I have a website that is 4+ years old.  Back then we created a directory
> (actually we have this problem in many directories) and instead of an
> index.html we had a file named archives.html.  We added ssi at some point
> and since the search engines had the archive.html indexed we created
> archives.shtml and turned the archive.html into a redirect page.
> 
> Later we created an index.html and inserted this code:
> 
>   <META HTTP-EQUIV="Refresh" CONTENT="1;
> URL=http://www.prophezine.com/search/database/archives.html">
> 
> Turns out that when I insert http://www.prophezine.com/search/database/ in
> the config file it only indexes the index.html page that is returned.  That
> page has some meta tags but no body.
> 
> What is needed is a change to the spider to follow the refresh tag.  I am
> not sure of all the tags possible so there may be another to follow but this
> should definitely be followed.
> 
> Thoughts?
> 
> Bob
> 
> 
Received on Mon Feb 28 07:22:45 2000