Skip to main content.
home | support | download

Back to List Archive

Re: Installed on Red Hat Linux Enterprise AS 3 and get spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jun 23 2004 - 16:30:40 GMT
On Wed, Jun 23, 2004 at 08:57:54AM -0700, John Kelley wrote:
> I installed both swish-e2.2.3 and 2.4.2 and get the following (this is from
> 2.4.2):
> Indexing Data Source: "External-Program"
> Indexing "./spider.pl"
> External Program found: ./spider.pl
> http://www.myjewishlearning.com/spider.html - Using HTML2 parser -  (11565
> words)
> 
> Warning: Unknown header line: 'elim/TraditionalFarewell.htm'>Traditional
> Farewells</a><br>' from program ./spider.pl

Can't tell you for sure.  My guess is some type of encoding problem --
perhaps multi-byte characters.

It's somewhat easy to figure out, if you have a good editor.

Run 

    ./spider.pl > out.file

then look at out.file.  The format looks a lot like an HTTP or email
message.  There's a few header lines, then two newlines then the
content.  One of the header lines is the content-length.  And what's
likely happening is the value of the content-length is not really the
length of the actual content for some reason.

In the past what has happened is the the content-length was reported as
the number of *characters* by spider.pl, but swish-e expects that to be
the number of *bytes* -- which will be different if you have any
multi-byte characters.

Spider.pl was patched to report *bytes*, so it's been a while since
anyone has reported such a problem.  So, I could be wrong and your
problem might be something else.

You might play with your LANG environment variable.  Google will find
lots of reports about Redhat users and UTF-8 encoding problems.  I have
LANG=en_US on my machine.

Anyway, use of a good editor will allow you to see what's happening.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Jun 23 16:30:42 2004