On Thu, Dec 04, 2003 at 06:17:45AM -0800, Matt Torbin wrote:
> Does anyone know if it is possible to have Swish-e thread a PDF and use
> the information stored in the "Custom Properties" dialog? This can be
> found in Acrobat by going to:
>
> Document Properties > Custom
If that's the meta data associated with a pdf, yes. Swish provides that
to you when indexing.
$ swish-filter-test -c test.pdf | grep meta
Document test.pdf was filtered.
Document: test.pdf (test.pdf)
Content-Type: text/html
Parser type: HTML*
>Filter used: SWISH::Filters::Pdf2HTML=HASH(0x837ec04) ( application/pdf -> text/html )
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
<meta name="encrypted" content="no">
<meta name="file_size" content="32194 bytes">
<meta name="moddate" content="Fri Mar 21 21:42:23 2003">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">
<meta name="pages" content="2">
<meta name="pdf_version" content="1.3">
<meta name="producer" content="Acrobat Distiller 5.0.5 for Macintosh">
<meta name="tagged" content="no">
<meta name="title" content="Microsoft Word - LFE02a.doc">
Now, you may want to modify those, so that could be done either in a
filter_content callback in the spider config, or by modifying the
existing Pdf2HTML filter (when it's creating those meta tags). For
example, you might want to store one of those date in Swish-e as a
timestamp (for sorting or limiting).
--
Bill Moseley
moseley@hank.org
Received on Thu Dec 4 14:31:23 2003