Hi,
I'm using spider.pl to perform a web spidering and pass info to swish-e
to index the contents of a few sites as well as some metatags.
I'm using SWISH-E 2.2.3.
Here is my 'spider_config':
@servers = (
{
skip => 0,
debug => DEBUG_INFO | DEBUG_SKIPPED | DEBUG_LINKS |
DEBUG_FAILED | DEBUG_HEADERS,
base_url => 'http://www.somesite.com/catalog',
same_hosts => [],
agent => 'swish-e spider http://swish-e.org/',
email => 'nuno.ferreira@globalti.pt',
use_md5 => 1,
test_url => sub { $_[0]->path =~ /catalog/ },
test_response => sub {
my $content_type = $_[2]->content_type;
return $content_type =~ m!text/html!;
#my $ok = grep { $_ eq $content_type } qw{
text/html text/plain };
#return 1 if $ok;
#return;
},
delay_min => 0.01,
keep_alive => 1,
}
);
And here is my 'swish-e.conf':
IndexReport 3
IndexFile /usr/local/swish/somesite.dat
IndexDir /usr/local/bin/spider.pl
MetaNames descricao sku keywords nomeproduto
SwishProgParameters /usr/local/swish/spider_config
I start the spidering/indexing like this:
# swish-e -c /path/to/swish-e.conf -S prog
It starts and it looks like it is doing everything I want, then it
suddenly crashes with:
<SNIP>
Looking at extracted tag '<td background="/images/verao_foo_d.jpg">'
! Found 0 links in
http://www.somesite.com/catalog/formas.php?PHPSESSID=85c724f87fc7f0e6842
5e6454bb4e11d
http://www.somesite.com/catalog/detras_loja.php?PHPSESSID=85c724f87fc7f0
e68425e6454bb4e11d - Using DEFAULT (HTML2) parser - (565 words)
err: External program failed to return required headers Path-Name: &
Content-Length:
.
</SNIP>
It always crashes in the same place. If I spider a different site, it
crashes also and always in the same place.
I've found this thread <http://swish-e.org/archive/3817.html> that is
related to my problem but after reading it, I became even more confused
because now I know that I may be looking at the wrong debug line because
of the beffering issues.
Can anyone explain what is happening and, hopefully, post a solution.
TIA,
Nuno
Received on Mon Mar 31 13:42:07 2003