On Thu, May 19, 2005 at 04:51:58PM +0200, koszalekopalek wrote:
> 1) Go to http://my.host/(00000)/doc1.htm
> 2) Populate %visited with http://my.host/(00000)/doc1.htm
> 3) Use filter_content to change
That just changes the output, right.
> 4) Index the document and keep on spidering
> 5) When the spider finds http://my.host/(11111)/doc1.htm
> it does not know that this URL was already spidered.
> So spidering goes on for ever...
Yes, that's true -- it's a different URL.
If I remember correctly, the %visited hash gets set when extracting
links, so it's not easy to do what you are trying. If you server
continues to give new URLs to follow then the spider will follow
those. (You might try changing $uri->path in a test_response
callback, but I don't think that will work).
So, just maintain your own %seen hash in the config file. Normalize
the URL and add it to %seen in test_url -- and return 0 if %seen(
$url ) already exists.
That's why the config file is not plain text -- so you can do things
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Thu May 19 08:22:47 2005