On Tue, 24 Mar 1998, Ruud van Meer wrote:
> I have no problem indexing MS Word files with swish-e. I'm using the
> swish.conf and add .doc to IndexOnly.
And the generated index files are far larger than they would
otherwise be if you performed text extraction first. Try a
Word document withm embedded PICT or EPS files and watch how
the size of your index balloons. Powerpoint files are the
worst. We have users who upload Powerpoint files that are a
few meg each.
The gibberish "words" indexed are all pretty unique -- this
brings SIWSH-E to its knees since it's binary tree degenerates
into a linked list having a running time of O(n) for word
insertion and lookup. For us, this happened with about 1500
such documents. If you've got less than that, sure, you can use
SWISH-E; but not if you have several thousand.
- Paul J. Lucas
NASA Ames Research Center Caelum Research Corporation
Moffett Field, California San Jose, California
<pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Tue Mar 24 08:12:55 1998