Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] just extracting link structure, not indexing content

From: Bill Moseley <moseley(at)>
Date: Fri Mar 09 2007 - 14:46:08 GMT
On Fri, Mar 09, 2007 at 12:46:52PM +0000, Darrell Berry wrote:
> Hash: SHA1
> Hi -- is there a standard way to just get the *link structure*  
> (rather than content index) of a site using the swish-e tools  
> ( i guess)?
> all i want from the output of my crawl is something like
> www.domain.tld -> www.domain.tld/help
> www.domain.tld -> www.domain.tld/info
> www.domain.tld/info -> www.domain.tld.info2
> www.domain.tld/info -> www.domain.tld
> ie just spidering the whole domain and showing which pages link to  
> which, recursively -- no content, no indexing...? i can find similar  
> questions in the archives, but not a definitive answer -- all help  
> appreciated

Try printing the url passed to the test_url() callback.  Then dump the
$server parameter passed to see if the partent url (the page where the
url was found) is listed.  If not, modify check_link() and stuff $base
into the $server hash.

    $server->{parent} = $base;

Then in your test_url() function print out the parent => url.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:

Users mailing list
Received on Fri Mar 9 09:42:13 2007