On Mon, 31 Mar 2003, Nuno Ferreira wrote:
> It starts and it looks like it is doing everything I want, then it
> suddenly crashes with:
> Looking at extracted tag '<td background="/images/verao_foo_d.jpg">'
> ! Found 0 links in
> e68425e6454bb4e11d - Using DEFAULT (HTML2) parser - (565 words)
> err: External program failed to return required headers Path-Name: &
> It always crashes in the same place. If I spider a different site, it
> crashes also and always in the same place.
> I've found this thread <http://swish-e.org/archive/3817.html> that is
> related to my problem but after reading it, I became even more confused
> because now I know that I may be looking at the wrong debug line because
> of the beffering issues.
First, see if this if a possible fix:
If you set debug => DEBUG_URL then it will display the URLs as they are
fetched and before swish gets the document. That should help find the
exact document where the problem is happening.
But that error "failed to return required headers" is likely due to the
*previous* document returning the wrong content length. The way extprog
works is it reads line-by-line to read the headers. Then when it sees a
blank line (that marks the end of the headers) it reads content-length
bytes in from the external program and starts over.
If that content length was short one byte, and last byte of the doc is a
\n then when it starts to read the next doc it will see just \n and assume
that's the end of the headers. But at that point no Content-Length or
Path-Name header is set so the program aborts with that error.
I suspect what is happening is that previous document has a wide char and
forcing perl into UTF-8 encoding. spider.pl is using "length" to
determine the length of the string, but that's the character lenght not
the byte length:
$ perl -MDevel::Peek -e '$x=chr(400);Dump($x);print "len=", length$x, "\n"'
SV = PV(0x80f6344) at 0x80fd2a4
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x80f9e58 "\306\220"\0
CUR = 2
LEN = 3
So the length of the string is two bytes, but "length" is returning one.
That would result in your problem.
I need to find a portable way for use with all versions of Perl to read
the correct byte length.
Bill Moseley email@example.com
Received on Mon Mar 31 14:39:40 2003