Skip to main content.
home | support | download

Back to List Archive

Re: advantages and disadvantages of indexing via the spider

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Feb 17 2004 - 13:56:16 GMT
On Mon, Feb 16, 2004 at 08:14:54AM -0800, Eric Lease Morgan wrote:
> I suppose I could use spider.pl to crawl the remote files and index 
> them. I could also use something like wget to create mirrors of the 
> files and index them that way.

Hey, that's the unix way -- specific tools for doing specific tasks.
I think creating a mirror with wget is a fine idea.  IIRC, wget can
modify paths if any of the URLs contain query parameters.  So that would
be a problem.  But if it's just static content then it should work fine.
Wget will only update modified files if you use time-stamping --
assuming the source provides the dates.

> What are the advantages and disadvantages of either approach? If I use 
> the spider, the I don't need nearly as much local disk space. If I do 
> the mirroring thing, then I have local copies and I save on network 
> bandwidth.

Ah, what's a little disk space?  What you save is indexing time.  Run the
spider in the background or from cron separately to keep the mirror up
to date and then index locally from another cron job.

spider.pl does have the advantage of it being Perl.  It can easily be
modified to do things wget can't do.  It's been on my todo list for a
while to add a caching feature to spider.pl -- maybe use BerkeleyDB to
store compressed pages and meta data like URL and modified date, 
and modify it to do HEAD requests to check the last modified date to see
if the page is up to date.  Patches welcome...

-- 
Bill Moseley
moseley@hank.org
Received on Tue Feb 17 05:56:17 2004