Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu May 19 2005 - 19:21:21 GMT
On Thu, May 19, 2005 at 11:57:04AM -0700, Bill Moseley wrote:
> One thing I'm seeing is the Referer: header is the new $uri set in the
> filter_content() callback.  Your server isn't looking at the Referer
> header, is it?

Here's how to fix the Referer header, just in case:

Since you are modifying the $uri object for output you are making
a global change to that object.

In the spider() sub there's these lines:

        my $new_links = process_link( $server, $uri, $parent, $depth );

        push @link_array, map { [ $_, $uri, $depth+1 ] } @$new_links if $new_links;

process_link() fetches $uri, calls test_response() at the start of the
response, then calls filter_content() after fetching the content.
Finally, links are extracted from the page and for each linke
test_url() is called.  That list of extracted links is what is
returned from prosess_link (i.e. $new_links).  Then those links are
added to the @link_array for later processing.  Each entry in
@link_array is an array of the link, the link's parent, and the depth
of the link.

Notice that $uri in the second line (in that three element array)?
That's the "parent".  So since you modify $uri during process_link()
the "parent" gets changed and that is used as the Referer: header in
later requests.

I think the easy solution is changing that first line to clone the
$uri:

        my $new_links = process_link( $server, $uri->clone, $parent, $depth );

Then the Referer: header will be ok.







-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu May 19 12:21:24 2005