Skip to main content.
home | support | download

Back to List Archive

Re: Using a translated link for the 'found' hyperlink,

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Nov 28 2005 - 19:34:01 GMT
On Mon, Nov 28, 2005 at 08:39:38AM -0800, J. David Boyd wrote:
> Okay, something is wrong.  I've added in titles to a few PDF files,
> I can see the titles if I right-click in windows, and use the PDF tab,
> or open the file in Adobe Reader, and look at the document properties.

Here's how my mind works:


pdfinfo is used to find the title, so what does that show?

moseley@bumby:~$ pdfinfo 050819-securing-mac-os-x-tiger.pdf  | grep Title
Title:          Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc

Ok, so there's the title.  What does swish-filter-test do?

moseley@bumby:~$ swish-filter-test -content -quiet 050819-securing-mac-os-x-tiger.pdf  | grep title
<meta name="title" content="Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc">

Oh, no <title>.  That's odd.  Maybe some docs will help:


$ perldoc Pdf2HTML.pm 
Pdf2HTML(3)           User Contributed Perl Documentation          Pdf2HTML(3)



NAME
       SWISH::Filters::Pdf2HTML - Perl extension for filtering PDF documents with Swish-e

DESCRIPTION
       This is a plug-in module that uses the xpdf package to convert PDF documents to html for
       indexing by Swish-e.  Any info tags found in the PDF document are created as meta tags.

       This filter plug-in requires the xpdf package available at:

           http://www.foolabs.com/xpdf/

       You may pass into SWISH::Filter's new method a tag to use as the html <title> if found in
       the PDF info tags:

           my %user_data;
           $user_data{pdf}{title_tag} = 'title';

           $was_filtered = $filter->filter(
               document  => $filename,
               user_data => \%user_data,
           );

       Then if a PDF info tag of "title" is found that will be used as the HTML <title>.

Ah, so I need to set the user data title tag.  (I wonder why that's
not the default??)  Those docs are a bit old, too, as now you run
"convert" not "filter".

There's a few "fixes":

One would be to alias 'title' to swishtitle:

$ cat c
PropertyNameAlias swishtitle title

$  /usr/local/lib/swish-e/spider.pl default file:///home/moseley/050819-securing-mac-os-x-tiger.pdf | swish-e -c c -S prog -i stdin -v0 -T properties
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

Summary for: file:///home/moseley/050819-securing-mac-os-x-tiger.pdf
         Connection: Close:      1  (1.0/sec)
               Total Bytes: 90,623  (90623.0/sec)
                Total Docs:      1  (1.0/sec)
               Unique URLs:      1  (1.0/sec)
application/pdf->text/html:      1  (1.0/sec)
          swishdocpath: 6 ( 55) S: "file:///home/moseley/050819-securing-mac-os-x-tiger.pdf"
            swishtitle: 7 ( 58) S: "Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc"
          swishdocsize: 8 (  4) N: "90623"
     swishlastmodified: 9 (  4) D: "2005-09-10 15:32:47 PDT"

That works.  Notice the swishtitle?

$ swish-e -w the -x 'Title = [%t]\n' -H0
Title = [Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc]


Another way would be to do what the docs say and setup the spider to
set the user data.  Or maybe just modify spider.pl's default settings
if that's what you are using:

        my $doc = $filter->convert(
            document     => $content_ref,
            name         => $response->base,
            content_type => $content_type,
            user_data    => {
                pdf => {
                    title_tag   => 'title',
                },
            },
        );


Or another way would be to update Pdf2HTML.pm

    my $title_tag = $user_data->{pdf}{title_tag} if ref $user_data eq 'HASH';
    $title_tag ||= 'title';  # add this line

So you have a few options to pick from.




> Bill said that using the title was the default behavior.  Is there
> someplace else perhaps that I need to toggle a bit?

Sorry, I didn't look careful enough -- if I would have only read the
docs better.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Nov 28 11:34:07 2005