Skip to main content.
home | support | download

Back to List Archive

Re: Using a translated link for the 'found' hyperlink,

From: J. David Boyd <david(at)not-real.adboyd.com>
Date: Thu Dec 01 2005 - 13:23:52 GMT
Bill Moseley wrote:
> On Mon, Nov 28, 2005 at 08:39:38AM -0800, J. David Boyd wrote:
> 
>>Okay, something is wrong.  I've added in titles to a few PDF files,
>>I can see the titles if I right-click in windows, and use the PDF tab,
>>or open the file in Adobe Reader, and look at the document properties.
> 
> 
> Here's how my mind works:
> 
> 
> pdfinfo is used to find the title, so what does that show?
> 
> moseley@bumby:~$ pdfinfo 050819-securing-mac-os-x-tiger.pdf  | grep Title
> Title:          Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc
> 
> Ok, so there's the title.  What does swish-filter-test do?
> 
> moseley@bumby:~$ swish-filter-test -content -quiet 050819-securing-mac-os-x-tiger.pdf  | grep title
> <meta name="title" content="Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc">
> 
> Oh, no <title>.  That's odd.  Maybe some docs will help:
> 
> 
> $ perldoc Pdf2HTML.pm 
> Pdf2HTML(3)           User Contributed Perl Documentation          Pdf2HTML(3)
> 
> 
> 
> NAME
>        SWISH::Filters::Pdf2HTML - Perl extension for filtering PDF documents with Swish-e
> 
> DESCRIPTION
>        This is a plug-in module that uses the xpdf package to convert PDF documents to html for
>        indexing by Swish-e.  Any info tags found in the PDF document are created as meta tags.
> 
>        This filter plug-in requires the xpdf package available at:
> 
>            http://www.foolabs.com/xpdf/
> 
>        You may pass into SWISH::Filter's new method a tag to use as the html <title> if found in
>        the PDF info tags:
> 
>            my %user_data;
>            $user_data{pdf}{title_tag} = 'title';
> 
>            $was_filtered = $filter->filter(
>                document  => $filename,
>                user_data => \%user_data,
>            );
> 
>        Then if a PDF info tag of "title" is found that will be used as the HTML <title>.
> 
> Ah, so I need to set the user data title tag.  (I wonder why that's
> not the default??)  Those docs are a bit old, too, as now you run
> "convert" not "filter".
> 
> There's a few "fixes":
> 
> One would be to alias 'title' to swishtitle:
> 
> $ cat c
> PropertyNameAlias swishtitle title
> 
> $  /usr/local/lib/swish-e/spider.pl default file:///home/moseley/050819-securing-mac-os-x-tiger.pdf | swish-e -c c -S prog -i stdin -v0 -T properties
> /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
> 
> Summary for: file:///home/moseley/050819-securing-mac-os-x-tiger.pdf
>          Connection: Close:      1  (1.0/sec)
>                Total Bytes: 90,623  (90623.0/sec)
>                 Total Docs:      1  (1.0/sec)
>                Unique URLs:      1  (1.0/sec)
> application/pdf->text/html:      1  (1.0/sec)
>           swishdocpath: 6 ( 55) S: "file:///home/moseley/050819-securing-mac-os-x-tiger.pdf"
>             swishtitle: 7 ( 58) S: "Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc"
>           swishdocsize: 8 (  4) N: "90623"
>      swishlastmodified: 9 (  4) D: "2005-09-10 15:32:47 PDT"
> 
> That works.  Notice the swishtitle?
> 
> $ swish-e -w the -x 'Title = [%t]\n' -H0
> Title = [Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc]
> 
> 
> Another way would be to do what the docs say and setup the spider to
> set the user data.  Or maybe just modify spider.pl's default settings
> if that's what you are using:
> 
>         my $doc = $filter->convert(
>             document     => $content_ref,
>             name         => $response->base,
>             content_type => $content_type,
>             user_data    => {
>                 pdf => {
>                     title_tag   => 'title',
>                 },
>             },
>         );
> 
> 
> Or another way would be to update Pdf2HTML.pm
> 
>     my $title_tag = $user_data->{pdf}{title_tag} if ref $user_data eq 'HASH';
>     $title_tag ||= 'title';  # add this line
> 
> So you have a few options to pick from.
> 
> 
> 
> 
> 
>>Bill said that using the title was the default behavior.  Is there
>>someplace else perhaps that I need to toggle a bit?
> 
> 
> Sorry, I didn't look careful enough -- if I would have only read the
> docs better.
> 

Just wanted to say thanks very much!  Everything is working great now.

Dave in Largo, FL
Received on Thu Dec 1 05:23:52 2005