have you turned on debugging and compared the response headers that
spider.pl sees with those that a 'normal' user agent gets?
Maybe your server is configured to respond differently to certain user
agents?
bill
On Thursday, May 19, 2005, at 09:07 AM, koszalekopalek wrote:
> koszalekopalek wrote:
>> Bill Moseley wrote:
>>
>>> If I remember correctly, the %visited hash gets set when extracting
>>> links, so it's not easy to do what you are trying.
>>
>>
>> Ok, I whipped this up. The %bogus_visited hash is populated in
>> test_url subroutine. The spider is running now. Do you think
>> it will work?
>
>
> Looks like it worked :-) At least the spider is
> not spidering any more.
>
> Btw, any pointers on why the server is not happy
> with use_cookies => 1, ?
>
> A.
>
>
>
>> sub dbg {
>> open (FH, ">> __dbg.log");
>> print (FH "$_[0]");
>> close (FH);
>> };
>>
>> sub my_test_url {
>> my $uri = shift;
>> my $path = $uri->path;
>> my $url = $uri->canonical;
>>
>> my %bogus_visted;
>>
>> # skip images
>> return 0 if $uri->path =~ m{\.(gif|png|jpeg|jpg)$}i;
>> # skip archives
>> return 0 if $uri->path =~ m{\.(zip|gz|tgz|tar)$}i;
>>
>> # hash for bogus urls
>> # change http://my.host/(A(AcWS....4PGw2))/default.aspx
>> # to http://my.host/(__bogus__)/default.aspx
>> if ($url =~ s{/\(A\(.*?\)\)}{(__bogus__)}) {
>> if ($bogus_visited{$url}) {
>> dbg ("BOGUS (duplicate): $url\n");
>> return 0;
>> } else {
>> dbg ("BOGUS (new): $url\n");
>> $bogus_visited{$url} = 1;
>> };
>> };
>> return 1;
>> };
>>
>
> ------------------------------------------------------------------
> Randka przez komorke?
>>> http://link.interia.pl/f187f <<
>
>
Received on Thu May 19 10:04:30 2005