Skip to main content.
home | support | download

Back to List Archive

RE: External program failed to return required headers Path-Name: & Content-Length:

From: Nuno Ferreira <nuno.ferreira(at)not-real.globalti.pt>
Date: Tue Apr 01 2003 - 09:55:18 GMT
Hi,

Exactly the same happens *with* your patch...

> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org] 
> Sent: segunda-feira, 31 de Março de 2003 20:24
> To: Nuno Ferreira
> Cc: 'Multiple recipients of list'
> Subject: RE: [SWISH-E] External program failed to return 
> required headers Path-Name: & Content-Length:
> 
> 
> qOn Mon, 31 Mar 2003, Nuno Ferreira wrote:
> 
> > I am not the sysadmin of the remote sites. I'll try to 
> speak to them.
> > I can test any patch that you want me to try.
> 
> You can just make a local copy of spider.pl, so you shouldn't need the
> help of the sysadmin.
> 
> Then, as long as you are running something like Perl 5.6.1 or 
> newer look
> in spider.pl for:
> 
>     my $headers = join "\n",
>         'Path-Name: ' .  $uri,
>         'Content-Length: ' . length $$content,
>         '';
> 
> and replace it with something like:
> 
>     my $doc_length = do { use bytes; length $$content };
> 
>     my $headers = join "\n",
>         'Path-Name: ' .  $uri,
>         'Content-Length: $doc_length',
>         '';
> 
> I suppose you might even be able to just place:
> 
>    use bytes;
> 
> toward the top of spider.pl and it would work, too.  But 
> there might be
> some other side-effects so the above might be a safer fix for now.
> 
> 
> 
> 
> 
> 
> 
> 
> > 
> > Regards,
> > Nuno
> > 
> > > -----Original Message-----
> > > From: Bill Moseley [mailto:moseley@hank.org] 
> > > Sent: segunda-feira, 31 de Março de 2003 15:35
> > > To: Nuno Ferreira
> > > Cc: Multiple recipients of list
> > > Subject: Re: [SWISH-E] External program failed to return 
> > > required headers Path-Name: & Content-Length:
> > > 
> > > 
> > > On Mon, 31 Mar 2003, Nuno Ferreira wrote:
> > > 
> > > > It starts and it looks like it is doing everything I 
> want, then it
> > > > suddenly crashes with:
> > > > <SNIP>
> > > > Looking at extracted tag '<td 
> background="/images/verao_foo_d.jpg">'
> > > > ! Found 0 links in
> > > > 
> > > http://www.somesite.com/catalog/formas.php?PHPSESSID=85c724f87
> > > fc7f0e6842
> > > > 5e6454bb4e11d
> > > > 
> > > http://www.somesite.com/catalog/detras_loja.php?PHPSESSID=85c7
> > > 24f87fc7f0
> > > > e68425e6454bb4e11d - Using DEFAULT (HTML2) parser -  (565 words)
> > > > err: External program failed to return required headers 
> Path-Name: &
> > > > Content-Length:
> > > > .
> > > > </SNIP>
> > > > 
> > > > It always crashes in the same place. If I spider a 
> > > different site, it
> > > > crashes also and always in the same place.
> > > > I've found this thread 
> > > <http://swish-e.org/archive/3817.html> that is
> > > > related to my problem but after reading it, I became even 
> > > more confused
> > > > because now I know that I may be looking at the wrong debug 
> > > line because
> > > > of the beffering issues.
> > > 
> > > First, see if this if a possible fix:
> > > 
> >   http://swish-e.org/archive/4870.html
> > 
> > 
> > If you set debug => DEBUG_URL then it will display the URLs 
> as they are
> > fetched and before swish gets the document.  That should 
> help find the
> > exact document where the problem is happening.
> > 
> > But that error "failed to return required headers" is 
> likely due to the
> > *previous* document returning the wrong content length.  
> The way extprog
> > works is it reads line-by-line to read the headers.  Then 
> when it sees a
> > blank line (that marks the end of the headers) it reads 
> content-length
> > bytes in from the external program and starts over.
> > 
> > If that content length was short one byte, and last byte of 
> the doc is a
> > \n then when it starts to read the next doc it will see just \n and
> > assume
> > that's the end of the headers.  But at that point no 
> Content-Length or
> > Path-Name header is set so the program aborts with that error.
> > 
> > I suspect what is happening is that previous document has a 
> wide char
> > and
> > forcing perl into UTF-8 encoding.  spider.pl is using "length" to
> > determine the length of the string, but that's the 
> character lenght not
> > the byte length:
> > 
> > $ perl -MDevel::Peek -e '$x=chr(400);Dump($x);print "len=", 
> length$x,
> > "\n"'
> > SV = PV(0x80f6344) at 0x80fd2a4
> >   REFCNT = 1
> >   FLAGS = (POK,pPOK,UTF8)
> >   PV = 0x80f9e58 "\306\220"\0
> >   CUR = 2
> >   LEN = 3
> > len=1
> > 
> > So the length of the string is two bytes, but "length" is 
> returning one.
> > That would result in your problem.
> > 
> > I need to find a portable way for use with all versions of 
> Perl to read
> > the correct byte length.
> > 
> > 
> > 
> 
> -- 
> Bill Moseley moseley@hank.org
> 
> 
> 
Received on Tue Apr 1 09:56:34 2003