Skip to main content.
home | support | download

Back to List Archive

Re: Focused Spidering - Multiple Hosts

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Tue Feb 28 2006 - 17:27:19 GMT
What you want to do isn't part of swish-e, per se.

If I was tackling this kind of project (and it sounds interesting), I 
would likely use a combination of Swish-e and a database. You probably 
want to spider your initial set of URLs, add them to a database in some 
way, then spider an additional set of URLs based on your first pass. You 
wouldn't want to just follow every link in every site, since you might 
end up trying to spider Google or something equally troublesome. But you 
could use the spider.pl and a database to extract offsite URLs into a 
list, then point the spider.pl at those URLs, and repeat, etc. Managing 
the list of URLs you spider is a tricky thing, best done in some 
combination of human review and scripting automation.

my 2cents.

Shay Lawless scribbled on 2/28/06 5:48 AM:
> Hi All,
> 
> I am a newcomer to the list, I have searched the archive in an attempt to
> answer my query before posting and have not been successful, but apologies
> if this is something that has already been discussed and resolved.
> 
> I am a PhD student in Trinity College Dublin. My research involves the use
> of open corpus content in elearning, i.e. using freely available learning
> content, from both the www and digital libraries, to generate online
> learning offerings / courses, personalised to individuals needs. As part of
> this I need to implement a focused web crawler / spider to create an index
> of sourced learning content on the www, that can then be searched. This is
> where swish-e comes in!
> 
> Having trawled through the multiple indexer / crawler / spider technologies
> out there, the fact the swish-e indexes web pages as well as supporting
> searching by meta tags etc makes it a pretty good match to what I require.
> However, having read the swish-e documentation I see that the spider.pl is
> not designed to spider across offsite links or multiple hosts. I realise
> that by adding to the @servers array it is possible to spider multiple
> websites, however in my case the sites required to be crawled will only be
> discovered as the crawl progresses.
> 
> Has anyone out there configured swish-e to perform a focused web-crawl
> without the provision of all the host machine names upfront? Is it even
> possible for this to happen within the swish-e functionality?
> 
> Any help you can provide will be greatly appreciated, thanks in advance,
> 
> Shay
> 
> 
> 
> *********************************************************************
> Due to deletion of content types excluded from this list by policy,
> this multipart message was reduced to a single part, and from there
> to a plain text message.
> *********************************************************************
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Tue Feb 28 09:27:23 2006