Bill Moseley wrote:
> On Mon, Nov 28, 2005 at 08:39:38AM -0800, J. David Boyd wrote:
>
>>Okay, something is wrong. I've added in titles to a few PDF files,
>>I can see the titles if I right-click in windows, and use the PDF tab,
>>or open the file in Adobe Reader, and look at the document properties.
>
>
> Here's how my mind works:
>
>
> pdfinfo is used to find the title, so what does that show?
>
> moseley@bumby:~$ pdfinfo 050819-securing-mac-os-x-tiger.pdf | grep Title
> Title: Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc
>
> Ok, so there's the title. What does swish-filter-test do?
>
> moseley@bumby:~$ swish-filter-test -content -quiet 050819-securing-mac-os-x-tiger.pdf | grep title
> <meta name="title" content="Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc">
>
> Oh, no <title>. That's odd. Maybe some docs will help:
>
>
> $ perldoc Pdf2HTML.pm
> Pdf2HTML(3) User Contributed Perl Documentation Pdf2HTML(3)
>
>
>
> NAME
> SWISH::Filters::Pdf2HTML - Perl extension for filtering PDF documents with Swish-e
>
> DESCRIPTION
> This is a plug-in module that uses the xpdf package to convert PDF documents to html for
> indexing by Swish-e. Any info tags found in the PDF document are created as meta tags.
>
> This filter plug-in requires the xpdf package available at:
>
> http://www.foolabs.com/xpdf/
>
> You may pass into SWISH::Filter's new method a tag to use as the html <title> if found in
> the PDF info tags:
>
> my %user_data;
> $user_data{pdf}{title_tag} = 'title';
>
> $was_filtered = $filter->filter(
> document => $filename,
> user_data => \%user_data,
> );
>
> Then if a PDF info tag of "title" is found that will be used as the HTML <title>.
>
> Ah, so I need to set the user data title tag. (I wonder why that's
> not the default??) Those docs are a bit old, too, as now you run
> "convert" not "filter".
>
> There's a few "fixes":
>
> One would be to alias 'title' to swishtitle:
>
> $ cat c
> PropertyNameAlias swishtitle title
>
> $ /usr/local/lib/swish-e/spider.pl default file:///home/moseley/050819-securing-mac-os-x-tiger.pdf | swish-e -c c -S prog -i stdin -v0 -T properties
> /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
>
> Summary for: file:///home/moseley/050819-securing-mac-os-x-tiger.pdf
> Connection: Close: 1 (1.0/sec)
> Total Bytes: 90,623 (90623.0/sec)
> Total Docs: 1 (1.0/sec)
> Unique URLs: 1 (1.0/sec)
> application/pdf->text/html: 1 (1.0/sec)
> swishdocpath: 6 ( 55) S: "file:///home/moseley/050819-securing-mac-os-x-tiger.pdf"
> swishtitle: 7 ( 58) S: "Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc"
> swishdocsize: 8 ( 4) N: "90623"
> swishlastmodified: 9 ( 4) D: "2005-09-10 15:32:47 PDT"
>
> That works. Notice the swishtitle?
>
> $ swish-e -w the -x 'Title = [%t]\n' -H0
> Title = [Microsoft Word - 7 - Securing Mac OS X 10 4 Tiger v1.0.doc]
>
>
> Another way would be to do what the docs say and setup the spider to
> set the user data. Or maybe just modify spider.pl's default settings
> if that's what you are using:
>
> my $doc = $filter->convert(
> document => $content_ref,
> name => $response->base,
> content_type => $content_type,
> user_data => {
> pdf => {
> title_tag => 'title',
> },
> },
> );
>
>
> Or another way would be to update Pdf2HTML.pm
>
> my $title_tag = $user_data->{pdf}{title_tag} if ref $user_data eq 'HASH';
> $title_tag ||= 'title'; # add this line
>
> So you have a few options to pick from.
>
>
>
>
>
>>Bill said that using the title was the default behavior. Is there
>>someplace else perhaps that I need to toggle a bit?
>
>
> Sorry, I didn't look careful enough -- if I would have only read the
> docs better.
>
Just wanted to say thanks very much! Everything is working great now.
Dave in Largo, FL
Received on Thu Dec 1 05:23:52 2005