Skip to main content.
home | support | download

Back to List Archive

Re: Behavior of max_depth in spider.pl

From: andy rosbrook <andy_rosbrook(at)not-real.hotmail.com>
Date: Fri Jan 12 2007 - 15:53:52 GMT
Ahh, i understand it now. I'll have a play about with that callback! Thanks 
Bill and Cas :)

andy


>From: Bill Moseley <moseley@hank.org>
>Reply-To: moseley@hank.org
>To: Multiple recipients of list <swish-e@sunsite3.berkeley.edu>
>Subject: [SWISH-E] Re: Behavior of max_depth in spider.pl
>Date: Fri, 12 Jan 2007 06:56:03 -0800 (PST)
>
>On Fri, Jan 12, 2007 at 06:28:24AM -0800, andy rosbrook wrote:
> > Hello all,
> >
> > I am curious on how the max_depth setting works in spider.pl and sub
> > domains. For example if i index the url www.somesite.com/sub/ and set 
>the
> > max_depth to 2 will the spider stay within the sub folder for links or 
>will
> > it look inside somesite.com?
>
>max_depth isn't what you probably think it is.
>
>IIRC, depths is a measurement of how far a link is from the top level
>page where you started the spider.  That is, how many "click" it took
>to get to the current page from the top page.
>
>Obviously, you can often get to a given page by different click paths.
>So the same page can have different depths depending on how the spider
>find the page.
>
>It's not a measurement of, say, how many path segments a file is from
>the root.  That's trivial to measure and to test in a "test_url"
>function (just split the path on "/" and count).
>
>max_depth is just there because it's not something that can be counted
>outside of the spider (i.e. in your config).
>
>I think the docs on max_depth discuss this -- yes, slightly, even with
>the misspellings.
>
>
> > I've done a few tests and it seems to go back up into root folders at
> > certain times, i assume when it needs more links? Can anyone explain how 
>it
> > traverses the pages and if it is possible to limit the spider to only 
>take
> > links from the sub domain?
>
>The only built in limit the spider has is to stay within the domain.
>If you start at www.somesite.com/sub/ the spider will follow links to
>the root if they exist.  If you want it to say always within /sub/
>then test that in "test_url".  There's an example of this in the
>sample spider config "SwishSpiderConfig.pl" included in the
>distribution:
>
>sub test_url {
>     my ( $uri, $server ) = @_;
>     # return 1;  # Ok to index/spider
>     # return 0;  # No, don't index or spider;
>
>     # ignore any common image files
>     return if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;
>
>     # make sure that the path is limited to the docs path
>     return $uri->path =~ m[^/current/docs/];
>}
>
>--
>Bill Moseley
>moseley@hank.org
>
>Unsubscribe from or help with the swish-e list:
>    http://swish-e.org/Discussion/
>
>Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
>

_________________________________________________________________
MSN Hotmail is evolving  check out the new Windows Live Mail 
http://ideas.live.com
Received on Fri Jan 12 07:53:54 2007