Skip to main content.
home | support | download

Back to List Archive

Re: Combining MetaNames directive with PDF's Document Properties

From: Bill Moseley <moseley(at)>
Date: Thu Dec 04 2003 - 14:31:18 GMT
On Thu, Dec 04, 2003 at 06:17:45AM -0800, Matt Torbin wrote:
> Does anyone know if it is possible to have Swish-e thread a PDF and use 
> the information stored in the "Custom Properties" dialog?  This can be 
> found in Acrobat by going to:
> Document Properties > Custom

If that's the meta data associated with a pdf, yes.  Swish provides that 
to you when indexing.

$ swish-filter-test -c test.pdf | grep meta

Document test.pdf was  filtered.
   Document:     test.pdf  (test.pdf)
   Content-Type: text/html
   Parser type:  HTML*

   >Filter used: SWISH::Filters::Pdf2HTML=HASH(0x837ec04) ( application/pdf -> text/html )
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
<meta name="encrypted" content="no">
<meta name="file_size" content="32194 bytes">
<meta name="moddate" content="Fri Mar 21 21:42:23 2003">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">
<meta name="pages" content="2">
<meta name="pdf_version" content="1.3">
<meta name="producer" content="Acrobat Distiller 5.0.5 for Macintosh">
<meta name="tagged" content="no">
<meta name="title" content="Microsoft Word - LFE02a.doc">

Now, you may want to modify those, so that could be done either in a 
filter_content callback in the spider config, or by modifying the 
existing Pdf2HTML filter (when it's creating those meta tags).  For 
example, you might want to store one of those date in Swish-e as a 
timestamp (for sorting or limiting).

Bill Moseley
Received on Thu Dec 4 14:31:23 2003