On Mon, 23 Mar 1998, Dean Robson wrote:
> I have downloaded swish-e and autoswish with the intention to look at its
> suitability to index our internal documents. Typically these documents are
> in MS word format and MS Excel.
>
> Can swish handle this?
Not really.
> Alternatively, are there other sun based indexers available?
Yes, SWISH++ (see the link to it via the SWISH-E home page).
This is one of the primary reasons I wrote SWISH++. SWISH++
does not index MS documents directly; rather it includes a
utility to extract the raw text out of such documents, e.g.:
my.doc -> my.doc.txt
your.xls -> your.xls.txt
You then index only the *.txt documents. The Perl-CGI/web
interface knows to recognize a file having a double file
extension and substitutes the correct filename on the fly.
The text extraction isn't perfect -- it can't be without an
understaning of English (or other native human languages) and
words in a dictionary. But it errs on the conservative side
and extracts gibberish words (sequence of binary data inside
the MS file that just so happen to also be ASCII, e.g.,
"BXZPH") and such "words" are also indexed; however, since
nobody will ever search on such a word (presumeably), all it
does it bloat the size of the index file.
Sometimes a file is mostly gibberish. This results in mountains
of data being thrown at the indexing engine. SWISH-E was
crushed under the immense weight and running out of memory;
SWISH++ indexes moutains of data just fine. And this result,
i.e., of being able to index such documents vs. not being able
to index them is preferable.
- Paul J. Lucas
NASA Ames Research Center Caelum Research Corporation
Moffett Field, California San Jose, California
<pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Mon Mar 23 17:34:48 1998