Skip to main content.
home | support | download

Back to List Archive

(no subject)

From: Shaffer, Chris <Chris.Shaffer(at)not-real.BELLSOUTH.COM>
Date: Sun Aug 29 2004 - 20:16:53 GMT

I had a similar situation.  Because some of our sites are dynamic in
nature, we chose to go with spidering.  However, I found some
documentation around setting up spidering a little confusing (there was
a lot of it, it was just ordered a little weird).  I think what the
documentation could use is a Spidering Getting Started Guide.  They way
the documentation is right now, its kind of like piecing together a

Here's what I did to spider all the sites I needed:

First, create a swish.conf file:
	# Example for spidering
	# Use the "" program included with Swish-e

	# Allow extra searching by title, path
	Metanames swishtitle swishdocpath

	# Only index .html .htm and .q files
	IndexOnly .html .htm .txt

	# Set StoreDescription for each parser
	#  to display context with search results
	StoreDescription TXT* 10000
	StoreDescription HTML* <body> 10000

	# Define what site to index
	SwishProgParameters ./spider.conf

Secondly, create a spider.conf file.  See attached file (spider.conf)
for a sample that contains some sane defaults.

Now, run the command: swish-e -S prog -c swish.conf -v2

What that will do is call swish.conf, which in turn calls spider.conf.
The way I've got everything setup assumes you've got the proper
filtering installed for docs, xls, and pdf.

I hope this helps.

Chris Shaffer

-----Original Message-----
[] On Behalf Of David Nickel
Sent: Friday, August 27, 2004 1:16 PM
To: Multiple recipients of list
Subject: [SWISH-E] Indexing University Site

We are trying to set up swish-e to index our universities web server. I
having trouble creating a config file that indexes all of our sites. We
have a main page and underneath we have pages for official departments.
Should the IndexDir be set to or /path/to/web/root?

In help would be much appreciated. Thanks

Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
Received on Sun Aug 29 13:17:32 2004