Skip to main content.
home | support | download

Back to List Archive

Indexing remote web sites with SWISH++

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Tue Dec 29 1998 - 00:04:27 GMT
	I recently discovered the GNU wget utility.  It seems very
	robust and like it can crawl through a remote web site in any
	way you could think of.

	Given its existence and my general loathing of reinventing the
	wheel, it seems fairly easy to make SWISH++ index remote web
	sites using it by providing a simple "glue" script wget2index:

		#! /usr/local/bin/perl
		while ( <> ) {
			print "$1\n" if /-> "([^"]+)"/;
		}

	Given that, you can now do:

		wget -rxnv -linf -A txt,html -X/cgi-bin \
		http://www.other-site.com 2>&1 | wget2index | index -

	to copy a remote site to a local filesystem that 'index' can
	index.  Your Perl CGI script that calls search could have to
	know to take the first directory name in an index and make that
	the hostname.

	If local filesystem space is an issue, i.e., you don't want to
	copy an entire other web site to your local filesystem as you
	index it, I'm sure it would be possible to write a slightly
	more complicated Perl script that would delete the files after
	they are indexed as the get/index cycle progresses.  You'd
	probably en up doing something using the IPC::Open2 Perl module
	(see the Perl 5 "Camel" book, p. 344): open a bidirectional
	pipe to index with the -v3 option so the script could tell when
	file has been indexed so the file could be deleted safely.

	- Paul
Received on Mon Dec 28 16:04:44 1998