Skip to main content.
home | support | download

Back to List Archive

Re: Focused Spidering - Multiple Hosts

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Feb 28 2006 - 15:34:19 GMT
On Tue, Feb 28, 2006 at 03:48:19AM -0800, Shay Lawless wrote:
> Having trawled through the multiple indexer / crawler / spider technologies
> out there, the fact the swish-e indexes web pages as well as supporting
> searching by meta tags etc makes it a pretty good match to what I require.
> However, having read the swish-e documentation I see that the spider.pl is
> not designed to spider across offsite links or multiple hosts. I realise
> that by adding to the @servers array it is possible to spider multiple
> websites, however in my case the sites required to be crawled will only be
> discovered as the crawl progresses.

Are you talking about an intranet or the Internet?

I suspect if you plan on finishing your PhD in the next decade or so
you might need to find a faster way to spider.   It would be trivial
to make the spider ignore the host name, but it would be too slow
running a single process on a single machine to ever finish.  You
would likely need hundreds, if not thousands, of machines widely
distributed to spider the entire Internet.

Could you use an existing index, such as Google, to find the
documents you want indexed?

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Feb 28 07:34:24 2006