Skip to main content.
home | support | download

Back to List Archive

Re: spider bug?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Oct 05 2004 - 17:57:08 GMT
On Tue, Oct 05, 2004 at 09:19:36AM -0700, Mark Morgan wrote:
> Warning: Unknown header line: 'tml>Path-Name:
> http://www.e-caps.com/za/ECP?PAGE=FDA_DISCLAIMER&OMI=10093,10072&AMI=10093'
> from program spider.pl
> err: External program failed to return required headers Path-Name:

Short answer is I don't know what the problem is.

Here's the longer answer:

The spider outputs records one right after another without any
end-of-file marker.  Instead it generates a content-length header
saying how many bytes long the content is.  The header looks like:

    Path-Name: http://www.e-caps.com/za/ECP?PAGE=FDA_DISCLAIMER&OMI=10093,10072&AMI=10093
    Content-Length: 11879
    Document-Type: html*

So swish reads that header and then knows to read in 11879 byes after
the header.  After it does that it starts over and expects to find a
header:  But, instead it found:

   tml>Path-Name:

So swish-e read in four too few bytes.  That is, the spider said to
read 11879 bytes but the content was really 11883 bytes long.  Or,
another possibility is you are running on Windows and swish-e asked
the OS for 11879 bytes but something weird happened with line ending
conversion or something like that.

Now, spider.pl should be reporting the correct number of bytes:

    # ugly and maybe expensive, but perhaps more portable than "use bytes"
    my $bytecount = length pack 'C0a*', $$content;

But, you could try replacing that with:

    use bytes;
    $bytecount = length $$content;

and see if that fixes it.

Now, what I'd do (and what I did) was setup the spider to fetch just
the above document, plus one other.  Then I indexed that to see if the
problem showed up.  It didn't.  So that either means I didn't grab
the correct documents, or the problem doesn't show up on my machine.

But you can try it and then try indexing the resulting output and see
if the problem shows up.  Then you can do things like look at the
actual content and figure out if it's the content-length header that's
wrong or if it's how swish-e is reading the file back int.

I will note that in the past what has happened is we had multi-byte
characters that make the content-length header too short -- but then
swish-e reads too much so you could get an error like:

    Warning: Unknown header line: 'th-Name:

because swish-e read too many bytes.

Is all that clear?

Anyway, here's a little test config:

    my @urls = (
        'http://www.e-caps.com/za/ECP?PAGE=FDA_DISCLAIMER&OMI=10093,10072&AMI=10093',
        'http://www.e-caps.com/za/ECP?PAGE=PRIVACY&OMI=10092,10072&AMI=10092',
    );

    @servers = ( {
        email => 'your@email.here',
        base_url => \@urls,
        test_url => sub { return grep { $_[0] eq $_ } @urls },
    } );
1;

then run:

   spider.pl test.config > output
   swish-e -S prog -i stdin < output

and see where it breaks.

A few other comments: your page says it's 8859-1 and I don't see any
odd entities so I doubt it's a multi-byte char problem.

Your page does not end in a new line.  Shouldn't be a problem but 
something you might try.

Might try running your pages through http://validator.w3.org/ --
again swish and the spider shouldn't care about broken HTML, but you
might spot something.

Finally, when you do find the problem can you post back for the archive?






-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Oct 5 10:57:23 2004