On Wed, Jun 23, 2004 at 08:57:54AM -0700, John Kelley wrote:
> I installed both swish-e2.2.3 and 2.4.2 and get the following (this is from
> 2.4.2):
> Indexing Data Source: "External-Program"
> Indexing "./spider.pl"
> External Program found: ./spider.pl
> http://www.myjewishlearning.com/spider.html - Using HTML2 parser - (11565
> words)
>
> Warning: Unknown header line: 'elim/TraditionalFarewell.htm'>Traditional
> Farewells</a><br>' from program ./spider.pl
Can't tell you for sure. My guess is some type of encoding problem --
perhaps multi-byte characters.
It's somewhat easy to figure out, if you have a good editor.
Run
./spider.pl > out.file
then look at out.file. The format looks a lot like an HTTP or email
message. There's a few header lines, then two newlines then the
content. One of the header lines is the content-length. And what's
likely happening is the value of the content-length is not really the
length of the actual content for some reason.
In the past what has happened is the the content-length was reported as
the number of *characters* by spider.pl, but swish-e expects that to be
the number of *bytes* -- which will be different if you have any
multi-byte characters.
Spider.pl was patched to report *bytes*, so it's been a while since
anyone has reported such a problem. So, I could be wrong and your
problem might be something else.
You might play with your LANG environment variable. Google will find
lots of reports about Redhat users and UTF-8 encoding problems. I have
LANG=en_US on my machine.
Anyway, use of a good editor will allow you to see what's happening.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Jun 23 16:30:42 2004