Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:213] Re: Indexing of MS word documents

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Tue Mar 24 1998 - 16:03:34 GMT
On Tue, 24 Mar 1998, Ruud van Meer wrote:

> I have no problem indexing MS Word files with swish-e. I'm using the
> swish.conf and add .doc to IndexOnly.

	And the generated index files are far larger than they would
	otherwise be if you performed text extraction first.  Try a
	Word document withm embedded PICT or EPS files and watch how
	the size of your index balloons.  Powerpoint files are the
	worst.  We have users who upload Powerpoint files that are a
	few meg each.

	The gibberish "words" indexed are all pretty unique -- this
	brings SIWSH-E to its knees since it's binary tree degenerates
	into a linked list having a running time of O(n) for word
	insertion and lookup.  For us, this happened with about 1500
	such documents. If you've got less than that, sure, you can use
	SWISH-E; but not if you have several thousand.

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Tue Mar 24 08:12:55 1998