On Wed, Sep 03, 2003 at 01:02:53PM -0700, Thomas Dowling wrote:
> I am trying to use SWISH-E (I've tried both 2.2.3 and 2.4.0 pr1) to
> spider our website. Following directions in the documentation, I set up
> a basic swish.conf and spider.conf, and my indexing run always bombs
> with the message:
> err: External program failed to return required headers Path-Name: &
> I found what appeared to be an identical problem report in the list
> archives from last April (<http://swish-e.org/archive/5149.html>), but
> didn't see a definitive solution posted there. None of the suggestions
> offered there affect the problem here.
That error message is typically because the length is set wrong on the
*previous* document and then when swish-e tries to read the document
it's reading in the wrong place in the stream.
> I took the liberty of inserting a line into spider.pl to print out the
> headers, and every document it reports on does have Path-Name and
> Content-Length headers, which makes me suspect the problem is either
> with swish-e itself or in the interaction between spider.pl and swish-e.
I often do things the hard way. For example, I've taken the output from
spider.pl to a file, then one-by-one extract out each document and
verify that its content-length is indeed its byte length.
The problem is (may be depending on the version of Perl and the LANG
setting) that spider.pl uses length() to set the content-length
header, but for multi-byte chars (which swish-e won't support) the
length() and the size of the data can be two different things. So I
have also edited spider.pl, and where it grabs the length() I have
written out the file to disk and then stat'ed the check if the length is
the same as the file size.
> I've tried this against multiple web sites. The number of files scanned
> before the indexing run dies varies from site to site, but is consistent
> on each site. FWIW, I'm running swish-e under RedHat 8.0 with Perl
> 5.8.0 (and, if I'm reading things correctly, LWP 5.65).
I think it was RedHat 9 where the default LANG is UTF-8. There have
been problems reported in this case. I'm not sure if it applies to RH
Assuming that this is a multi-byte character problem:
There's is some code in spider.pl's output_content() function
that was suppose to fix this:
# ugly and maybe expensive, but perhaps more portable than "use bytes"
my $bytecount = length pack 'C0a*', $$content;
$ perl -le '$x=chr(400); print length pack "C0a*", $x'
Here's with "use bytes;" pragma.
$ perl -le '$x=chr(400); print length $x'
$ perl -le '$x=chr(400); use bytes; print length $x'
Received on Wed Sep 3 21:14:57 2003