Bill Moseley wrote:
> You should be able to set
>
> use_cookies => 1,
>
> in the spider config to enable its cookie jar.
Actually I tried that but it did not work - the spider
was still re-directed to the "funny" URLs, (i.e.
http://my.host/(A(A_random_string_inserted))/some/path )
>>I tried two callback functions for spider.pl (test_url and
>>filter_content) but both did not work:
>
> test_url wouldn't work because that's before the request to the
> server is made -- it would change what is requested from the server.
Ok, this is clear. Thanks.
>>sub my_filter_content {
>> my $path = $uri->path;
>> # remove random string from $path
>> $path =~ s{/\(A\(.*?\)\)}{};
>> $uri->path ($path);
>> return 1;
>>}
>>
>>URLs are correctly re-written but the spider never stops spidering.
>>This is what happens (I guess):
[...]
> I think it's something else. The %visited hash gets set before all
> of that.
Ok, so what I thought was happening was this:
1) Go to http://my.host/(00000)/doc1.htm
2) Populate %visited with http://my.host/(00000)/doc1.htm
3) Use filter_content to change
http://my.host/(00000)/doc1.htm
to
http://my.host/doc1.htm
4) Index the document and keep on spidering
5) When the spider finds http://my.host/(11111)/doc1.htm
it does not know that this URL was already spidered.
So spidering goes on for ever...
Do I get it right?
> This worked fine for me. It still spiders the same number of
> documents:
>
> @servers = (
> {
> base_url => 'http://localhost/apache/index.html',
> use_default_config => 1,
> filter_content => sub {
> my ( $uri, $response, $server, $content_ref ) = @_;
> $uri->path('hello');
> return 1;
> },
> }
> );
It works, but does your server keep generating "bogus" new
links using this random_string (http://my.host/(random_string)/doc1.htm)
?
A.
------------------------------------------------------------------
Nowa odslona kultowej gry rajdowej nadjezdza z piskiem opon.
Scigaj sie Maluchami po polskich drogach w Maluch Racer 2
zobacz >> http://www.play.com.pl/index.php?go=opis&id=2891&&bid=14048
Received on Thu May 19 07:50:01 2005