Many thanks to Bill, who responded immediately. It was indeed the
double-byte issue, Red Hat 9 using en_US.UTF-8 for default LANG. I changed
that, and it resolved the issue.
From: Bill Conlon [mailto:email@example.com]
Sent: Tuesday, October 05, 2004 9:26 AM
Subject: Re: [SWISH-E] spider bug?
ah, the dreaded double-byte character rears its ugly head.
on Redhat set environment:
On Tuesday, October 5, 2004, at 09:20 AM, Mark Morgan wrote:
> I'm trying to index a client site, www.e-caps.com. I'm using 2.5.2,
> have tried 2.4.2, with the same results. Some pages are OK, but one is
> confusing spider.pl. I get:
> Parsing config file 'e-caps.conf'
> Indexing Data Source: "External-Program"
> Indexing "spider.pl"
> External Program found: /usr/local/lib/swish-e/spider.pl
> /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
> http://www.e-caps.com/za/ECP?PAGE=ABOUT_US - Using HTML2 parser - (470
> http://www.e-caps.com/za/ECP?PAGE=HOME - Using HTML2 parser - (409
> http://www.e-caps.com/za/ECP?PAGE=PRODUCTS_MAIN - Using HTML2 parser -
> http://www.e-caps.com/za/ECP?PAGE=KNOWLEDGE - Using HTML2 parser -
> Warning: Unknown header line: 'tml>Path-Name:
> from program spider.pl
> err: External program failed to return required headers Path-Name:
> The knowledge page passes html validation as far as structure, yet for
> reason, it's leaving the spider with the extraneous 'tml>' string.
> My config is:
> # Configuration file for spidering the e-caps site
> # Use the "spider.pl" program included with Swish-e
> IndexDir spider.pl
> # Define what site to index
> SwishProgParameters default
> and the command is:
> swish-e -S prog -c e-caps.conf -v9
> Other pages on the site, as you can see in the first few, go OK, but
> some reason, the knowledge page makes it blow chunks. Anyone have any
> ideas? If I run with -S http, it goes OK, but I need to use prog, as
> have a bunch of PDF files that we want to index.
> | Mark Morgan
> | Senior Programmer/Analyst
> | T H E Z A N E R A Y G R O U P , I N C .
> | firstname.lastname@example.org
> | 25 O'Brien Avenue
> | Whitefish, MT 59937
> | 406.863.8000
> | http://www.zaneray.com
Received on Tue Oct 5 11:03:07 2004