Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Searching remote mail archive problem

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Mar 05 2008 - 14:11:42 GMT
On Wed, Mar 05, 2008 at 08:03:06PM +0800, Tian Xinchun wrote:
> Hi Peter´╝î
> 
> I am sorry that I can not quite understand what you mean. Taking a example:
> 
> $swish-e -c swish.conf -S prog
> Indexing Data Source: "External-Program"
> Indexing "spider.pl"
> External Program found: /usr/local/lib/swish-e/spider.pl
> /usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'
> https://www.lbl.gov/lists.archives/theta13-eng.archive/:1: error:
> htmlParseStartTag: invalid element name
> <?xml version="1.0" encoding="ISO-8859-1"?>
>  ^
> https://www.lbl.gov/lists.archives/theta13-eng.archive/:2: error: Misplaced
> DOCTYPE declaration
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> ^

You have two errors.  That first one above is simply saying you are
trying to index an xml document with Libxml's *html* parser.
So you need to use the XML* parser type.

> Warning: Unknown header line: 'ive/author.html' from program spider.pl
> err: External program failed to return required headers Path-Name:

What version of swish and spider.pl are you using?
You can look at spider.pl in an editor and find:

$VERSION = sprintf '%d.%02d', q$Revision: 1900 $ =~ /: (\d+)\.(\d+)/;

The way -S prog works is that each file sent to swish has a byte count
in the -S prog header.  This is the size of the document in bytes.
Once swish finds the blank link that indicates the end of the -S prog
header (which defines the filename, length, and possibly date and
parser type) it will read in the document in chunks until it reads in
that byte count.

When you get that "Unknown header line" it means that the byte count
for a document was wrong.  This typically means that, in this case,
spider.pl is reporting an incorrect count of bytes in the file -- and
that has been due to wide characters in the byte string.

As far as I know, that's a problem with spider.pl -- because,
regardless of the file's encoding (and even if reported incorrectly)
it should be able to convert the characters string into a byte string
and tell you the correct length.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Mar 5 09:11:44 2008