Skip to main content.
home | support | download

Back to List Archive

Re: Parse Error PDF -> HTML with metatag "keywords"

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Mar 21 2005 - 17:00:37 GMT
On Mon, Mar 21, 2005 at 02:17:37AM -0800, Scheermann Leonard wrote:
> I've been asked, whether swish-e parse keywords in a pdf file or not. I have
> tested a pdf file "keywords.pdf" (see attachment) with swish-filter-test and
> got the result in the file "parsed.htm" (see also attachment). In the
> STDOUT, just the meta tag "keywords" been false parsed (see parsed.htm).
> 
> Is that a bug or maybe is the cause the multiline keywords in the pdf file?

Swish parses the Info dictionary of the pdf file using the program
pdfinfo.  Newlines are allowed in the Info tags -- but I'm not clear
how pdfinfo deals with them as I don't have any pdf docs with
newlines in the tags that I know of.  Swish assume they will be
printed by pdfinfo as single lines.

If you have a multi-line pdf info tag then run pdfinfo on the file and
post back what it reports.  It will be easy for you to adjust the
filter, if needed.

> Does make it difference for searching results, give a pdf file document
> properties keywords (or meta tag "keywords" in a html file)?

The info tags are just placed in html <meta> tags.  How swish indexes
or doesn't index them is up to the swish-e config file.

>  <<keywords.pdf>>  <<parsed.htm>> 

Sorry, the mailing list software we have doesn't allow attachments.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Mar 21 09:00:41 2005