On Tue, 11 Aug 1998, Rainer Scherg RTC wrote:
> But at this moment I've only installed a filter for PDF files (on a Solaris
> machine). My hope hope is, that - if these feature will be released - people
> are starting to write filter progs for different file types. (MS-Word,
> XLS, PPT, and so on...)
It would be very nice to have Unix-based filters for Microsoft
Office formats, but reverse-engineering those formats would be
extremely difficult since you have to handle multiple versions
fo the, e.g., Word 5, Word 6, Word 97, Word 98, Mac, PC; ditto
for Excel and PowerPoint.
You can obtain, via a non-discloseure agreement, the detailed
file format for Word 98 directly from Microsoft at no charge,
but they don't offer the same for other formats (why I don't
know). However, the document (I obtained a copy) is really
terse. I do know that decoding would be non-trivial since you
have to take their "fast save" feature into account where older
text is not deleted from the document but merely appended to
the end, hence you have to know which bits of the file not to
index if you want to be very accurate.
It's for these reasons I punted in SWISH++ by writing a generic
text extraction process. It's not perfect (nor can be without
detailed file-format information) but it seems to do a good job
in practice. See the documentation for details.
- Paul J. Lucas
NASA Ames Research Center Caelum Research Corporation
Moffett Field, California San Jose, California
<pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Tue Aug 11 06:54:16 1998