Thanks for your help, See below.
> Message: 6
> Date: Wed, 5 Mar 2008 06:11:42 -0800
> From: Bill Moseley <email@example.com>
> Subject: Re: [swish-e] Searching remote mail archive problem
> To: Swish-e Users Discussion List <firstname.lastname@example.org>
> Message-ID: <20080305141142.GA6428@hank.org>
> Content-Type: text/plain; charset=utf-8
> On Wed, Mar 05, 2008 at 08:03:06PM +0800, Tian Xinchun wrote:
> > Hi Peter?
> > I am sorry that I can not quite understand what you mean. Taking a example:
> > $swish-e -c swish.conf -S prog
> > Indexing Data Source: "External-Program"
> > Indexing "spider.pl"
> > External Program found: /usr/local/lib/swish-e/spider.pl
> > /usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'
> > https://www.lbl.gov/lists.archives/theta13-eng.archive/:1: error:
> > htmlParseStartTag: invalid element name
> > <?xml version="1.0" encoding="ISO-8859-1"?>
> > ^
> > https://www.lbl.gov/lists.archives/theta13-eng.archive/:2: error: Misplaced
> > DOCTYPE declaration
> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> > ^
> You have two errors. That first one above is simply saying you are
> trying to index an xml document with Libxml's *html* parser.
> So you need to use the XML* parser type.
Actually, I have tried using XML*, but I still got the same error messages.
> > Warning: Unknown header line: 'ive/author.html' from program spider.pl
> > err: External program failed to return required headers Path-Name:
> What version of swish and spider.pl are you using?
> You can look at spider.pl in an editor and find:
> $VERSION = sprintf '%d.%02d', q$Revision: 1900 $ =~ /: (\d+)\.(\d+)/;
> The way -S prog works is that each file sent to swish has a byte
> count in the -S prog header. This is the size of the document in bytes.
> Once swish finds the blank link that indicates the end of the -S prog
> header (which defines the filename, length, and possibly date and
> parser type) it will read in the document in chunks until it reads in
> that byte count.
> When you get that "Unknown header line" it means that the byte count
> for a document was wrong. This typically means that, in this case,
> spider.pl is reporting an incorrect count of bytes in the file -- and
> that has been due to wide characters in the byte string.
> As far as I know, that's a problem with spider.pl -- because,
> regardless of the file's encoding (and even if reported incorrectly)
> it should be able to convert the characters string into a byte string
> and tell you the correct length.
Thanks for the information, and any plan on fixing it.
> Bill Moseley
> Unsubscribe from or help with the swish-e list:
> Help with Swish-e:
> Users mailing list
> End of Users Digest, Vol 15, Issue 3
Dr. Xinchun Tian
Room A601, Mobile: 13426390768
Experimental Physics Center, IHEP, CAS
Users mailing list
Received on Thu Mar 6 03:17:07 2008