Skip to main content.
home | support | download

Back to List Archive

Re: PDF-Files title in search results as Filename ???

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Feb 16 2005 - 17:19:11 GMT
On Wed, Feb 16, 2005 at 08:27:49AM -0800, Scheermann Leonard wrote:
> PDF-Files are indexed correctly, but in search results <swishtitle> property
> displays PDF-Filenames instead title of PDF-Files.
> The same problem is with word and excel files. Though HTML-Files are
> displayed with title.

I likely guess would be the pdf files don't have a <title> tag.

> FileFilter .pdf /home/swishe/swish-e/lib/swish-e/DirTree.pl

No. DirTree.pl is not a filter, it's a program for use with the -S
prog method.

Mind if I think out loud?

Hummm, so what does it produce??

Do I have a pdf with a title.  Oh here's one:

    moseley@bumby:~$ pdfinfo /usr/lib/Acrobat5/Reader/help/acrobat.pdf
    Title:          Adobe Acrobat Reader UpSell PDF
    Subject:        There's more to Acrobat than the Reader!
    Author:         Adobe Systems Incorporated
    Producer:       Acrobat Distiller 4.05 for Macintosh
    CreationDate:   Tue Dec 12 11:42:12 2000
    ModDate:        Tue Dec 12 11:48:03 2000
    Tagged:         no
    Pages:          1
    Encrypted:      no
    Page size:      611.379 x 792.237 pts
    File size:      28451 bytes
    Optimized:      no
    PDF version:    1.3


So how does DirTree.pl deal with that file?

    moseley@bumby:~$ rm -rf pdf

    moseley@bumby:~$ mkdir pdf

    moseley@bumby:~$ cp /usr/lib/Acrobat5/Reader/help/acrobat.pdf pdf/test.pdf

    moseley@bumby:~$ /usr/local/lib/swish-e/DirTree.pl pdf | head -30
    Path-Name: pdf/test.pdf
    Content-Length: 3831
    Last-Mtime: 1108572978
    Document-Type: HTML*

    <html>
    <head>
    <meta name="author" content="Adobe Systems Incorporated">
    <meta name="creationdate" content="Tue Dec 12 11:42:12 2000">
    <meta name="encrypted" content="no">
    <meta name="file_size" content="28451 bytes">
    <meta name="moddate" content="Tue Dec 12 11:48:03 2000">
    <meta name="optimized" content="no">
    <meta name="page_size" content="611.379 x 792.237 pts">
    <meta name="pages" content="1">
    <meta name="pdf_version" content="1.3">
    <meta name="producer" content="Acrobat Distiller 4.05 for Macintosh">
    <meta name="subject" content="There's more to Acrobat than the Reader!">
    <meta name="tagged" content="no">
    <meta name="title" content="Adobe Acrobat Reader UpSell PDF">
    </head>
    <body>
    <pre>
    There's more to than the
    How often does this happen to you?

    Acrobat Reader !
    TM

    

Ok, so there's no <title> tag, but there is a meta title.

So one solution would be to alias title to swishtitle, I suppose:

    moseley@bumby:~$ cat c
    PropertyNameAlias swishtitle title


    moseley@bumby:~$ /usr/local/lib/swish-e/DirTree.pl pdf | swish-e -S prog -i stdin -c c -v0 -T properties
              swishdocpath: 6 ( 12) S: "pdf/test.pdf"
                swishtitle: 7 ( 31) S: "Adobe Acrobat Reader UpSell PDF"
              swishdocsize: 8 (  4) N: "3831"
         swishlastmodified: 9 (  4) D: "2005-02-16 08:56:18 PST"

Well that wasn't too hard.  But what's the deal with no <title> tag in
the first place?

What does the filter's docs have to say:

    moseley@bumby:~$ PERL5LIB=`swish-filter-test -path` perldoc SWISH::Filters::Pdf2HTML.pm

    You may pass into SWISH::Filter's new method a tag to use as the html
    <title> if found in the PDF info tags:

        my %user_data;
        $user_data{pdf}{title_tag} = 'title';

        $was_filtered = $filter->filter(
            document  => $filename,
            user_data => \%user_data,
        );

    Then if a PDF info tag of "title" is found that will be used as the HTML <title>.

Oh, not sure why "title" isn't the default.  Should I patch in the
filter or in DirTree.pl?  Well, I have DirTree open in vim, so try it
there:

    moseley@bumby:~$ cp /usr/local/lib/swish-e/DirTree.pl .
    moseley@bumby:~$ vim DirTree.pl

    moseley@bumby:~$ diff -u /usr/local/lib/swish-e/DirTree.pl .
    --- /usr/local/lib/swish-e/DirTree.pl   2005-01-25 14:39:41.000000000 -0800
    +++ ./DirTree.pl        2005-02-16 09:10:11.000000000 -0800
    @@ -124,6 +124,7 @@
         if ( $filter ) {
             my $doc = $filter->convert(
                 document    => $path,
    +            user_data   => { pdf=> { title_tag => 'title' } },
             );
             unless ( $doc ) {
                 if ( $options{no_skip} ) {

    moseley@bumby:~$ ./DirTree.pl pdf | grep title
    <title>Adobe Acrobat Reader UpSell PDF</title>
    <meta name="title" content="Adobe Acrobat Reader UpSell PDF"

Yep, that works.

But the filter probably should default to "title" regardless (and fix
the docs that say to use filter->filter() call.

Cool.  Time for coffee.




-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Feb 16 09:19:17 2005