Many thanks to Bill, who responded immediately. It was indeed the
double-byte issue, Red Hat 9 using en_US.UTF-8 for default LANG. I changed
that, and it resolved the issue.
-----Original Message-----
From: Bill Conlon [mailto:bill@tothept.com]
Sent: Tuesday, October 05, 2004 9:26 AM
To: mark@zaneray.com
Subject: Re: [SWISH-E] spider bug?
ah, the dreaded double-byte character rears its ugly head.
on Redhat set environment:
LANG=en_us
On Tuesday, October 5, 2004, at 09:20 AM, Mark Morgan wrote:
> I'm trying to index a client site, www.e-caps.com. I'm using 2.5.2,
> and
> have tried 2.4.2, with the same results. Some pages are OK, but one is
> confusing spider.pl. I get:
>
> Parsing config file 'e-caps.conf'
> Indexing Data Source: "External-Program"
> Indexing "spider.pl"
> External Program found: /usr/local/lib/swish-e/spider.pl
> /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
> http://www.e-caps.com/za/ECP?PAGE=ABOUT_US - Using HTML2 parser - (470
> words)
> http://www.e-caps.com/za/ECP?PAGE=HOME - Using HTML2 parser - (409
> words)
> http://www.e-caps.com/za/ECP?PAGE=PRODUCTS_MAIN - Using HTML2 parser -
> (140
> words)
> http://www.e-caps.com/za/ECP?PAGE=KNOWLEDGE - Using HTML2 parser -
> (387
> words)
>
> Warning: Unknown header line: 'tml>Path-Name:
> http://www.e-caps.com/za/
> ECP?PAGE=FDA_DISCLAIMER&OMI=10093,10072&AMI=10093'
> from program spider.pl
> err: External program failed to return required headers Path-Name:
>
>
> The knowledge page passes html validation as far as structure, yet for
> some
> reason, it's leaving the spider with the extraneous 'tml>' string.
>
> My config is:
>
> # Configuration file for spidering the e-caps site
> # Use the "spider.pl" program included with Swish-e
> IndexDir spider.pl
>
> # Define what site to index
> SwishProgParameters default
> http://www.e-caps.com/za/ECP?PAGE=ABOUT_US
>
> and the command is:
>
> swish-e -S prog -c e-caps.conf -v9
>
>
>
> Other pages on the site, as you can see in the first few, go OK, but
> for
> some reason, the knowledge page makes it blow chunks. Anyone have any
> ideas? If I run with -S http, it goes OK, but I need to use prog, as
> we
> have a bunch of PDF files that we want to index.
>
>
> |
> | Mark Morgan
> | Senior Programmer/Analyst
> | T H E Z A N E R A Y G R O U P , I N C .
> |
> | mark@zaneray.com
> |
> | 25 O'Brien Avenue
> | Whitefish, MT 59937
> | 406.863.8000
> |
> | http://www.zaneray.com
> |
>
>
>
>
Received on Tue Oct 5 11:03:07 2004