Skip to main content.
home | support | download

Back to List Archive

Re: Swish-e PDF titles in search results

From: Bill Moseley <moseley(at)>
Date: Fri Jul 14 2006 - 16:44:57 GMT
On Fri, Jul 14, 2006 at 04:32:14PM +0100, Luke Simmons wrote:
> [root (at) tiger archive]# /usr/local/lib/swish-e/
> edjanfeb06.pdf | grep title
> <title>Jan Feb 06</title>
> <meta name="title" content="Jan Feb 06">
> But without a filter it appears to not be parsing the html output  
> from the pdf to the index. So after an index it doesn't show anything  
> up in the search (cgi) including the title.
> Do I need to add pdf2HTML as a file filter in the config? And also  
> make the changes that Peter Karman suggested?  (thanks Peter)
> FileFilter .pdf /usr/local/lib/swish-e/perl/SWISH/Filters/ 
>    # Does this or anything need to go here?

No, again, as you can see from the output of it's
producing *html*, so you don't want to tell swish to convert it to
html again.  It's already been converted.

> DefaultContents HTML*
> StoreDescription HTML* <body> 200000

So, if you run swish-e from the command line is is the body stored in
the swishdescription property?

> Am I right to believe that when indexing the process pulls the PDF  
> apart and each part is HTML tagged up (i.e. title > <title></title>  
> and the text snippet to <body></body>)?

Not indexing (swish-e), but does that (by using
SWISH::Filter).  Look at's output.

> Is the process then not putting the HTML into the index?

Should be.  You can use -T indexed_words properties to see what's
ending up in the index while indexing.

> I added the old FileFilter of pdftotext in and this runs ok just  
> without the title attribute working.

That shouldn't work.  pdftotext isn't very good at converting html to

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Fri Jul 14 09:44:59 2006