Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for spider.pl

From: Bill Conlon <bill(at)not-real.tothept.com>
Date: Thu May 19 2005 - 17:04:26 GMT
have you turned on debugging and compared the response headers that 
spider.pl sees with those that a 'normal' user agent gets?

Maybe your server is configured to respond differently to certain user 
agents?

bill
On Thursday, May 19, 2005, at 09:07  AM, koszalekopalek wrote:

> koszalekopalek wrote:
>> Bill Moseley wrote:
>>
>>> If I remember correctly, the %visited hash gets set when extracting
>>> links, so it's not easy to do what you are trying.
>>
>>
>> Ok, I whipped this up. The %bogus_visited hash is populated in
>> test_url subroutine. The spider is running now. Do you think
>> it will work?
>
>
> Looks like it worked :-) At least the spider is
> not spidering any more.
>
> Btw, any pointers on why the server is not happy
> with use_cookies => 1, ?
>
> A.
>
>
>
>> sub dbg {
>> 	open (FH, ">> __dbg.log");
>> 	print (FH "$_[0]");
>> 	close (FH);
>> };
>>
>> sub my_test_url {
>> 	my $uri = shift;
>> 	my $path = $uri->path;
>> 	my $url = $uri->canonical;
>> 	
>> 	my %bogus_visted;
>> 	
>> 	# skip images
>> 	return 0 if $uri->path =~ m{\.(gif|png|jpeg|jpg)$}i;
>> 	# skip archives
>> 	return 0 if $uri->path =~ m{\.(zip|gz|tgz|tar)$}i;
>> 	
>> 	# hash for bogus urls
>> 	# change   http://my.host/(A(AcWS....4PGw2))/default.aspx
>> 	# to       http://my.host/(__bogus__)/default.aspx
>> 	if ($url =~ s{/\(A\(.*?\)\)}{(__bogus__)}) {
>> 		if ($bogus_visited{$url}) {
>> 			dbg ("BOGUS (duplicate): $url\n");
>> 			return 0;
>> 		} else {
>> 			dbg ("BOGUS (new): $url\n");
>> 			$bogus_visited{$url} = 1;
>> 		};
>> 	};
>> 	return 1;
>> };
>>
>
> ------------------------------------------------------------------
> Randka przez komorke?
>>> http://link.interia.pl/f187f <<
>
>
Received on Thu May 19 10:04:30 2005