Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:422] Re: ndexing PDF

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Tue Aug 11 1998 - 13:44:38 GMT
On Tue, 11 Aug 1998, Rainer Scherg RTC wrote:

> But at this moment I've only installed a filter for PDF files (on a Solaris 
> machine). My hope hope is, that - if these feature will be released - people 
> are starting to write filter progs for different file types. (MS-Word,
> XLS, PPT, and so on...)

	It would be very nice to have Unix-based filters for Microsoft
	Office formats, but reverse-engineering those formats would be
	extremely difficult since you have to handle multiple versions
	fo the, e.g., Word 5, Word 6, Word 97, Word 98, Mac, PC; ditto
	for Excel and PowerPoint.

	You can obtain, via a non-discloseure agreement, the detailed
	file format for Word 98 directly from Microsoft at no charge,
	but they don't offer the same for other formats (why I don't
	know).  However, the document (I obtained a copy) is really
	terse.  I do know that decoding would be non-trivial since you
	have to take their "fast save" feature into account where older
	text is not deleted from the document but merely appended to
	the end, hence you have to know which bits of the file not to
	index if you want to be very accurate.

	It's for these reasons I punted in SWISH++ by writing a generic
	text extraction process.  It's not perfect (nor can be without
	detailed file-format information) but it seems to do a good job
	in practice.  See the documentation for details.

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Tue Aug 11 06:54:16 1998