Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] (no subject)

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Mon Jun 15 2009 - 02:07:49 GMT
Ullas wrote on 6/9/09 11:58 PM:
> Thanks for the reply ...
> 
> the output immediatley before the error is:
> 
> //////////////////////////////////////////////////////////////////////////////////////////
> 
> http://www.admiralmotorinn.com.au/index.php?pageid=4080 - Using HTML2
> parser -  (262 words)
> http://www.admiralmotorinn.com.au/index.php?pageid=3953 - Using HTML2
> parser -  (177 words)
> http://www.admiralmotorinn.com.au/index.php?pageid=4115 - Using HTML2
> parser -  (185 words)
> 
> Summary for: http://www.admiralmotorinn.com.au/
>      Connection: Close:      1  (0.5/sec)
> Connection: Keep-Alive:     12  (6.0/sec)
>             Duplicates:    149  (74.5/sec)
>         Off-site links:     74  (37.0/sec)
>            Total Bytes: 92,056  (46028.0/sec)
>             Total Docs:     13  (6.5/sec)
>            Unique URLs:     13  (6.5/sec)
> http://www.admiralmotorinn.com.au/index.php?pageid=3746 - Using HTML2
> parser -  (193 words)
> 
> Warning: External program returned zero Content-Length when processing
> file'http://www.admiralmotorinn.com.au/index.php?pageid=3746'
> http://www.admiralmotorinn.com.au/index.php?pageid=3746 - Using DEFAULT
> (HTML2) parser -  (no words indexed)
> err: External program failed to return required headers Path-Name:
> .
> 
> //////////////////////////////////////////////////////////////////////////////////////////
> 


Some of the URLs pulled out by the spider.pl seem to have escaped characters at
the end, likely because you have a href like:

 href="http://something  "

so the extra spaces get URL-escaped. Or perhaps they are encoded that way
already. In any case, there are multiple links to the problem URL with values like:

 http://www.admiralmotorinn.com.au/index.php?pageid=3746%0A%20%20

and those extra characters at the end are having 2 effects. (1) the same page is
being fetched multiple times, and (2) the extra space throws off the length()
check by one byte.

Not sure if it's a spider.pl bug or not, but adding this in your test_url() sub
ref fixed your particular problem:

 return 0 if $uri =~ m/\%20/;

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Jun 14 22:07:47 2009