On Wed, Feb 16, 2005 at 08:27:49AM -0800, Scheermann Leonard wrote:
> PDF-Files are indexed correctly, but in search results <swishtitle> property
> displays PDF-Filenames instead title of PDF-Files.
> The same problem is with word and excel files. Though HTML-Files are
> displayed with title.
I likely guess would be the pdf files don't have a <title> tag.
> FileFilter .pdf /home/swishe/swish-e/lib/swish-e/DirTree.pl
No. DirTree.pl is not a filter, it's a program for use with the -S
prog method.
Mind if I think out loud?
Hummm, so what does it produce??
Do I have a pdf with a title. Oh here's one:
moseley@bumby:~$ pdfinfo /usr/lib/Acrobat5/Reader/help/acrobat.pdf
Title: Adobe Acrobat Reader UpSell PDF
Subject: There's more to Acrobat than the Reader!
Author: Adobe Systems Incorporated
Producer: Acrobat Distiller 4.05 for Macintosh
CreationDate: Tue Dec 12 11:42:12 2000
ModDate: Tue Dec 12 11:48:03 2000
Tagged: no
Pages: 1
Encrypted: no
Page size: 611.379 x 792.237 pts
File size: 28451 bytes
Optimized: no
PDF version: 1.3
So how does DirTree.pl deal with that file?
moseley@bumby:~$ rm -rf pdf
moseley@bumby:~$ mkdir pdf
moseley@bumby:~$ cp /usr/lib/Acrobat5/Reader/help/acrobat.pdf pdf/test.pdf
moseley@bumby:~$ /usr/local/lib/swish-e/DirTree.pl pdf | head -30
Path-Name: pdf/test.pdf
Content-Length: 3831
Last-Mtime: 1108572978
Document-Type: HTML*
<html>
<head>
<meta name="author" content="Adobe Systems Incorporated">
<meta name="creationdate" content="Tue Dec 12 11:42:12 2000">
<meta name="encrypted" content="no">
<meta name="file_size" content="28451 bytes">
<meta name="moddate" content="Tue Dec 12 11:48:03 2000">
<meta name="optimized" content="no">
<meta name="page_size" content="611.379 x 792.237 pts">
<meta name="pages" content="1">
<meta name="pdf_version" content="1.3">
<meta name="producer" content="Acrobat Distiller 4.05 for Macintosh">
<meta name="subject" content="There's more to Acrobat than the Reader!">
<meta name="tagged" content="no">
<meta name="title" content="Adobe Acrobat Reader UpSell PDF">
</head>
<body>
<pre>
There's more to than the
How often does this happen to you?
Acrobat Reader !
TM
®
Ok, so there's no <title> tag, but there is a meta title.
So one solution would be to alias title to swishtitle, I suppose:
moseley@bumby:~$ cat c
PropertyNameAlias swishtitle title
moseley@bumby:~$ /usr/local/lib/swish-e/DirTree.pl pdf | swish-e -S prog -i stdin -c c -v0 -T properties
swishdocpath: 6 ( 12) S: "pdf/test.pdf"
swishtitle: 7 ( 31) S: "Adobe Acrobat Reader UpSell PDF"
swishdocsize: 8 ( 4) N: "3831"
swishlastmodified: 9 ( 4) D: "2005-02-16 08:56:18 PST"
Well that wasn't too hard. But what's the deal with no <title> tag in
the first place?
What does the filter's docs have to say:
moseley@bumby:~$ PERL5LIB=`swish-filter-test -path` perldoc SWISH::Filters::Pdf2HTML.pm
You may pass into SWISH::Filter's new method a tag to use as the html
<title> if found in the PDF info tags:
my %user_data;
$user_data{pdf}{title_tag} = 'title';
$was_filtered = $filter->filter(
document => $filename,
user_data => \%user_data,
);
Then if a PDF info tag of "title" is found that will be used as the HTML <title>.
Oh, not sure why "title" isn't the default. Should I patch in the
filter or in DirTree.pl? Well, I have DirTree open in vim, so try it
there:
moseley@bumby:~$ cp /usr/local/lib/swish-e/DirTree.pl .
moseley@bumby:~$ vim DirTree.pl
moseley@bumby:~$ diff -u /usr/local/lib/swish-e/DirTree.pl .
--- /usr/local/lib/swish-e/DirTree.pl 2005-01-25 14:39:41.000000000 -0800
+++ ./DirTree.pl 2005-02-16 09:10:11.000000000 -0800
@@ -124,6 +124,7 @@
if ( $filter ) {
my $doc = $filter->convert(
document => $path,
+ user_data => { pdf=> { title_tag => 'title' } },
);
unless ( $doc ) {
if ( $options{no_skip} ) {
moseley@bumby:~$ ./DirTree.pl pdf | grep title
<title>Adobe Acrobat Reader UpSell PDF</title>
<meta name="title" content="Adobe Acrobat Reader UpSell PDF"
Yep, that works.
But the filter probably should default to "title" regardless (and fix
the docs that say to use filter->filter() call.
Cool. Time for coffee.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Feb 16 09:19:17 2005