> On Tue, 11 Aug 1998, Rainer Scherg RTC wrote:
> > But at this moment I've only installed a filter for PDF files (on a
> > machine). My hope hope is, that - if these feature will be released -
> > are starting to write filter progs for different file types. (MS-Word,
> > XLS, PPT, and so on...)
> It would be very nice to have Unix-based filters for Microsoft
> Office formats, but reverse-engineering those formats would be
> extremely difficult since you have to handle multiple versions
> fo the, e.g., Word 5, Word 6, Word 97, Word 98, Mac, PC; ditto
> for Excel and PowerPoint.
> It's for these reasons I punted in SWISH++ by writing a generic
> text extraction process. It's not perfect (nor can be without
> detailed file-format information) but it seems to do a good job
> in practice. See the documentation for details.
I'm using a very simple filter prog to index Winword Docs on our servers:
FileFilter .doc simple_txt_extract.sh
----- snip --------
# -- simple_txt_extract.sh <docfile>
cat $1 | strings
---- snap ---------
You are getting occasionally some garbage characters - escpecially when
images are included. But it's sufficient for indexing the doc.
Additionally I've got a private response pointing to a tool called "catdoc".
BTW: Some words to swish++
So far I've only read the Readme file (some weeks ago).
But for my personal flavor swish++ is lacking some features I want to
see (e.g. Config Files) at this moment.
But I've a wish list for swish-e, too... ;-)
Received on Tue Aug 11 10:42:49 1998