Skip to main content.
home | support | download

Back to List Archive

FW: spider bug?

From: Mark Morgan <mark(at)not-real.zaneray.com>
Date: Tue Oct 05 2004 - 18:02:55 GMT
Many thanks to Bill, who responded immediately.  It was indeed the
double-byte issue, Red Hat 9 using en_US.UTF-8 for default LANG.  I changed
that, and it resolved the issue.


-----Original Message-----
From: Bill Conlon [mailto:bill@tothept.com]
Sent: Tuesday, October 05, 2004 9:26 AM
To: mark@zaneray.com
Subject: Re: [SWISH-E] spider bug?


ah, the dreaded double-byte character rears its ugly head.

on Redhat set environment:
LANG=en_us

On Tuesday, October 5, 2004, at 09:20  AM, Mark Morgan wrote:

> I'm trying to index a client site, www.e-caps.com.  I'm using 2.5.2,
> and
> have tried 2.4.2, with the same results.  Some pages are OK, but one is
> confusing spider.pl.  I get:
>
> Parsing config file 'e-caps.conf'
> Indexing Data Source: "External-Program"
> Indexing "spider.pl"
> External Program found: /usr/local/lib/swish-e/spider.pl
> /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
> http://www.e-caps.com/za/ECP?PAGE=ABOUT_US - Using HTML2 parser -  (470
> words)
> http://www.e-caps.com/za/ECP?PAGE=HOME - Using HTML2 parser -  (409
> words)
> http://www.e-caps.com/za/ECP?PAGE=PRODUCTS_MAIN - Using HTML2 parser -
>  (140
> words)
> http://www.e-caps.com/za/ECP?PAGE=KNOWLEDGE - Using HTML2 parser -
> (387
> words)
>
> Warning: Unknown header line: 'tml>Path-Name:
> http://www.e-caps.com/za/
> ECP?PAGE=FDA_DISCLAIMER&OMI=10093,10072&AMI=10093'
> from program spider.pl
> err: External program failed to return required headers Path-Name:
>
>
> The knowledge page passes html validation as far as structure, yet for
> some
> reason, it's leaving the spider with the extraneous 'tml>' string.
>
> My config is:
>
>     # Configuration file for spidering the e-caps site
>     # Use the "spider.pl" program included with Swish-e
>     IndexDir spider.pl
>
>     # Define what site to index
>     SwishProgParameters default
> http://www.e-caps.com/za/ECP?PAGE=ABOUT_US
>
> and the command is:
>
> swish-e  -S prog -c e-caps.conf -v9
>
>
>
> Other pages on the site, as you can see in the first few, go OK, but
> for
> some reason, the knowledge page makes it blow chunks.  Anyone have any
> ideas?  If I run with -S http, it goes OK, but I need to use prog, as
> we
> have a bunch of PDF files that we want to index.
>
>
> |
> |  Mark Morgan
> |  Senior Programmer/Analyst
> |  T H E   Z A N E R A Y   G R O U P ,  I N C .
> |
> |  mark@zaneray.com
> |
> |  25 O'Brien Avenue
> |  Whitefish, MT 59937
> |  406.863.8000
> |
> |  http://www.zaneray.com
> |
>
>
>
>
Received on Tue Oct 5 11:03:07 2004