Skip to main content.
home | support | download

Back to List Archive

Re: URL-fixing with callback routines for

From: Bill Moseley <moseley(at)>
Date: Thu May 19 2005 - 14:14:30 GMT
On Thu, May 19, 2005 at 04:23:36AM -0700, koszalekopalek wrote:
> Hello,
> The website that I'm trying to index uses two URL schemes:
> 1) if the browser/agent accepts cookies, "regular" urls are used,
>     for example:

You should be able to set

    use_cookies => 1,

in the spider config to enable its cookie jar.

> 2) if cookies are rejected a random string is inserted into the
>     URL, e.g.
>     The random strings change during the session.
> The second URL scheme (i.e. the one with the random_string) is used
> for I want to re-write the URLs so that swish-e returns
> "regular" URLs.
> I tried two callback functions for (test_url and 
> filter_content) but both did not work:

test_url wouldn't work because that's before the request to the
server is made -- it would change what is requested from the server.

> sub my_filter_content {
> 	my $path = $uri->path;
>          # remove random string from $path
> 	$path =~ s{/\(A\(.*?\)\)}{};
> 	$uri->path ($path);
> 	return 1;
> }
> URLs are correctly re-written but the spider never stops spidering.
> This is what happens (I guess):
> a) The spider reads, say:
> b) Feeds it to swish-e for indexing as:
> c) The spider enters the "funny" URL
> into the %visited hash. So the next
>     time when it comes across the same URL (but with a modified random
>     string, e.g. that url is not
>     considered visited. The spider reads the page again and sends it
>     to swish-e for (re-)indexing as

I think it's something else.  The %visited hash gets set before all
of that.

This worked fine for me.  It still spiders the same number of

@servers = (
        base_url => 'http://localhost/apache/index.html',
        use_default_config => 1,
        filter_content => sub {
            my ( $uri, $response, $server, $content_ref ) = @_;
            return 1;

Try that on your own small "web site" -- create three or four linked
pages and watch what happens.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Thu May 19 07:14:38 2005