koszalekopalek wrote:
> Bill Moseley wrote:
>
>>If I remember correctly, the %visited hash gets set when extracting
>>links, so it's not easy to do what you are trying.
>
>
> Ok, I whipped this up. The %bogus_visited hash is populated in
> test_url subroutine. The spider is running now. Do you think
> it will work?
Looks like it worked :-) At least the spider is
not spidering any more.
Btw, any pointers on why the server is not happy
with use_cookies => 1, ?
A.
> sub dbg {
> open (FH, ">> __dbg.log");
> print (FH "$_[0]");
> close (FH);
> };
>
> sub my_test_url {
> my $uri = shift;
> my $path = $uri->path;
> my $url = $uri->canonical;
>
> my %bogus_visted;
>
> # skip images
> return 0 if $uri->path =~ m{\.(gif|png|jpeg|jpg)$}i;
> # skip archives
> return 0 if $uri->path =~ m{\.(zip|gz|tgz|tar)$}i;
>
> # hash for bogus urls
> # change http://my.host/(A(AcWS....4PGw2))/default.aspx
> # to http://my.host/(__bogus__)/default.aspx
> if ($url =~ s{/\(A\(.*?\)\)}{(__bogus__)}) {
> if ($bogus_visited{$url}) {
> dbg ("BOGUS (duplicate): $url\n");
> return 0;
> } else {
> dbg ("BOGUS (new): $url\n");
> $bogus_visited{$url} = 1;
> };
> };
> return 1;
> };
>
------------------------------------------------------------------
Randka przez komorke?
>> http://link.interia.pl/f187f <<
Received on Thu May 19 09:07:57 2005