On Oct 26, 2010, at 10:09 PM, Peter Karman wrote:
> Troy Wical wrote on 10/26/10 11:06 PM:
>> Thanks for that. It's not the first time you've mentioned to me the issues of having modules installed from different areas. I edited spider.pl to point to the CPAN version and the errors are no more. I do get the following now though after it runs for a couple minutes, I believe it is not due to the page that is being crawled. Though, I've been wrong before.
>>
>> #############################################
>> Warning: Unknown header line: 'ath-Name: http://type2.com/ezmlm-archives/index.cgi?list=type2&cmd=monthbydate&month=201009' from program spider.pl
>> err: External program failed to return required headers Path-Name:
>> #############################################
>>
>
> that sounds like an encoding issue. The problem happens when the length reported
> in the previous document != the actual document length, and the leading 'P' gets
> read as part of the previous document.
>
> Turn on the spider.pl debugging verbosity to see each URL, and check the
> accuracy of the encoding and document length of the URI *before*
>
> http://type2.com/ezmlm-archives/index.cgi?list=type2&cmd=monthbydate&month=201009
I was using the spider.pl default, so I created a config file for the spider with the following...
###########################################
[root@purple /home/search]# more t2.spider.config
@servers = (
{
base_url => 'http://type2.com/ezmlm-archives/index.cgi?list=type2',
use_default_config => 1,
SPIDER_QUIET => 1,
email => 'troy@wical.com',
delay_sec => '0',
max_depth => '3',
keep_alive => '1',
errors => '1',
failed => '1',
},
);
##########################################
I have a similar config file working elsewhere, but perhaps it too is having issues I didn't know about, since I am getting the following errors...
##########################################
[root@purple /home/search]# swish-e -c /home/search/t2.conf -S prog
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: ** Warning: config option [errors] is unknown. Perhaps misspelled?
/usr/local/lib/swish-e/spider.pl: ** Warning: config option [SPIDER_QUIET] is unknown. Perhaps misspelled?
/usr/local/lib/swish-e/spider.pl: ** Warning: config option [failed] is unknown. Perhaps misspelled?
/usr/local/lib/swish-e/spider.pl: Reading parameters from 't2.spider.config'
http://type2.com/ezmlm-archives/index.cgi?list=type2:7: error: htmlParseEntityRef: expecting ';'
" title="RSS 2.0" href="http://type2.com/ezmlm-archives/index.cgi?list=type2&cmd
^
http://type2.com/ezmlm-archives/index.cgi?list=type2:7: error: htmlParseEntityRef: expecting ';'
.0" href="http://type2.com/ezmlm-archives/index.cgi?list=type2&cmd=feed&feedtype
^
http://type2.com/ezmlm-archives/index.cgi?list=type2:8: error: htmlParseEntityRef: expecting ';'
title="Atom 0.3" href="http://type2.com/ezmlm-archives/index.cgi?list=type2&cmd
^
###########################################
Perhaps syntax issues in the config file. I will try and work this out before getting back to the encoding issues.
Troy
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Oct 27 10:46:03 2010