Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Searching remote mail archive problem

From: Tian Xinchun <tianxc(at)not-real.ihep.ac.cn>
Date: Wed Mar 05 2008 - 12:03:06 GMT
Hi Peter£¬

I am sorry that I can not quite understand what you mean. Taking a example:

$swish-e -c swish.conf -S prog
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'
https://www.lbl.gov/lists.archives/theta13-eng.archive/:1: error:
htmlParseStartTag: invalid element name
<?xml version="1.0" encoding="ISO-8859-1"?>
 ^
https://www.lbl.gov/lists.archives/theta13-eng.archive/:2: error: Misplaced
DOCTYPE declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
^

Warning: Unknown header line: 'ive/author.html' from program spider.pl
err: External program failed to return required headers Path-Name:
.

As you pointed out, before the first doc that reports the "Unknown header
line"  is:
===========================================================================
https://www.lbl.gov/lists.archives/theta13-eng.archive/:2: error: Misplaced
DOCTYPE declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
^
===========================================================================
But I have tried to index other hypermail archive successfully while this kind
of error messages emerged. Does that mean that the hypermail archive on the
remote server does not produced correctly or something?

Thanks.

Best Regards,

Xinchun


Xinchun Tian wrote on 3/2/08 7:38 AM:
> Hi Peter,
> Thanks for your help, but the problem still does not resolved.
> Similiar errors also includes:
> When indexing: https://www.lbl.gov/lists.archives/theta13-offline.archive/:
> Warning: Unknown header line:
'https://www.lbl.gov/lists.archives/theta13-offline.archive/author.html' from
program spider.plerr: External program failed to return required headers
Path-Name:
> or https://www.lbl.gov/lists.archives/theta13-eng.archive/:
> Warning: Unknown header line: 'ive/author.html' from program spider.plerr:
External program failed to return required headers Path-Name:
> and other similiar error messages. It seems to me that spider.pl does not
parse the hypermail archive correctly. Any help?

The issue is that one doc breaks the indexer's sense of content length,
and swish-e can't recover its place afterwards. Often this is a case of encoding
not being reported correctly,
but it can also be other issues.

Find the first doc that reports the 'Unknown header line' and then look at the
doc that was indexed
just before it. The one before the errors start is your culprit.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users



--------------------------------------------------------------------------------

Xinchun Tian
2008-03-05
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Mar 5 07:04:05 2008