Skip to main content.
home | support | download

Back to List Archive

Swish-e PDF titles in search results

From: Luke Simmons <lukes(at)not-real.deeson.co.uk>
Date: Thu Jul 13 2006 - 13:45:20 GMT
Hi Bill,

I am emailing you in reference to a problem I am having with Swish-e.  
I have found and followed this reply in your discussion list - http:// 
www.swish-e.org/archive/2005-02/9062.html.

The person you helped, had a problem with PDF files being indexed  
without the title meta data being used as swishtitle. The swishtitle  
would show the filename of the pdf instead and this then shows up in  
the search results. HTML results would be fine though.

I too am experiencing this problem, and despite following your  
instructions on the reply you posted thoroughly, I still cannot get  
the PDFs to index correctly and present the document title as the  
swishtitle. My version of xpdf is up to date.

After completing the section on how DirTree.pl deals with the file  
(i.e. outputting the meta data contents of the PDF) -

[root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl  
edjanfeb06.pdf | head -30

<head>
<meta name="author" content="A person">
<meta name="creationdate" content="Tue Jan  3 11:10:41 2006">
<meta name="creator" content="QuarkXPress: pictwpstops filter 1.0">
<meta name="encrypted" content="no">
<meta name="file_size" content="2711063 bytes">
<meta name="moddate" content="Thu Jul 13 10:42:32 2006">
<meta name="optimized" content="no">
<meta name="page_size" content="595 x 842 pts (A4)">
<meta name="pages" content="36">
<meta name="pdf_version" content="1.5">
<meta name="producer" content="Acrobat Distiller 6.0.1 for Macintosh">
<meta name="tagged" content="no">
<meta name="title" content="Jan Feb 06">


I can get through to the section on your instructions where you  
request -

[root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl  
edjanfeb06.pdf | swish-e -S prog -i stdin -c ../../cgi-bin/archswish/ 
swish.conf -v0 -T properties

Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
      swishdocpath: 6 ( 16) S: "./edjanfeb06.pdf"
      swishdocsize: 8 (  4) N: "140528"
      swishlastmodified: 9 (  4) D: "2006-07-13 10:51:56 BST"

Warning: Unknown header line: '38' from program stdin

Everything up to this point works correctly, and I have put the  
"PropertyNameAlias swishtitle title" into swish.conf (my swish config  
file). Is there a specific place this should sit?

(contents of config file - swish.conf -

	IndexDir / -- server path -- /archive

	IndexOnly .htm .pdf .php
	MaxWordLimit 15

	PropertyNameAlias swishtitle title

	DefaultContents HTML*
	StoreDescription HTML* <body> 200000
	MetaNames swishdocpath swishtitle
	#MetaNames swishdefault

	ReplaceRules remove / -- server path -- /archive/
	FileFilter .pdf /usr/local/bin/pdftotext   "'%p' -"

)

Also despite the prompt suggesting the file may not be a PDF, this  
occurs on all PDFs that the command is ran on. Also it is not  
damaged, it was a fresh document (I also ran this on a very fresh  
clear PDF and it returned the same error).

I found the bit about PDF titles in Filters.pm, at the end where the  
comments suggest the inclusion of

         my %user_data;
         $user_data{pdf}{title_tag} = 'title';

         $was_filtered = $filter->filter(
             document  => $filename,
             user_data => \%user_data,
         );

into Pdf2HTML.pm. But with this I was unsure a) where to put this and  
b) whether it was required if the PropertyNameAlias directive was  
working?

I therefore ask for your help as to what am I doing wrong?


Thanks

Luke
Received on Thu Jul 13 06:45:37 2006