Skip to main content.
home | support | download

Back to List Archive

Re: Using a translated link for the 'found' hyperlink,

From: Michael Peters <mpeters(at)>
Date: Tue Nov 22 2005 - 15:17:54 GMT
J. David Boyd wrote:
> Bill Moseley wrote:
>>On Mon, Nov 21, 2005 at 08:02:32AM -0800, J. David Boyd wrote:
>>>Is it possible to hook into the index generation code of swish-e, and
>>>insert my own translation code, such that when the indexer sees
>>>S2171_TABLE.pdf, I can look in my translation table, and stuff the value
>>>'Module 72 Tables' to be displayed in the hyperlink of the search
>>>result?  Of course, the hyperlink still has to point to the original file.
>>Easily.  It should already do that, but maybe you don't have titles in
>>your pdf docs.
>>Anyway, in SWISH::Filters::Pdf2HTML (assuming that's what you are
>>using just set the title:
>>    $title ||= lookup_title( $file_name );
> I don't see any code that looks like what you have there.  I see code in
> sub filter() that sets a title.  Do I monkey around in there?  That
> looks, to me, like a good way to break something.

Bill was giving you a sample line to use, not code from the module.

> Now, as an alternative, I find that I can actually set a title in my PDF
> file, using pdftk.  It's kind of convoluted, but it works okay.

It's not really convoluted. The fact that swishe can't find a title in you PDF
means that there isn't one. Giving you PDF's actual titles is a good way to make
sure their titles show up when searched for. If you need these titles to be
table driven (as per you other email) you could write a simple Perl script that
will query the database, and foreach PDF there, find it and use pdftk to give it
the intended title.

Just make this script a part of whatever process you use to index your documents.

> I see that Pdf2HTML mentions that it can store the title, but it doesn't
> work by default.(By which I mean that I have manually set some titles
> in my PDF files, run the index, perform a search, and it shows the file
> name as the hyperlink, rather than the PDF file's internal title)

It does work by default. It's just that your PDF's don't actually have titles.

> ---------------------------------
> You may pass into SWISH::Filter's new method a tag to use as the html
> <title> if found in the PDF info tags:
>     my %user_data;
>     $user_data{pdf}{title_tag} = 'title';
>     $was_filtered = $filter->filter(
>         document  => $filename,
>         user_data => \%user_data,
>     );
> Then if a PDF info tag of "title" is found that will be used as the HTML
> <title>.
> ---------------------------------
> Does this mean that if I copy the actual code (skipping comments, of
> course), from the above quoted section, and place it into the sub new()
> function, that I will be adding in the ability to read the titles?  If
> so, where do I put it?  Before the return statement obviously (even to
> me), but does it go inside of bless(), before it, after it?

Doing the above won't magically fix your title problem. What the above means is
that after Pdf2HTML converts the PDF to an HTML doc, that you dont necessarily
have to use the PDF's title (which becomes the HTML <title>) as the title in the
 index. You could use something else if you wanted to (eg, <h1>).

But before you go this route, I'd give your PDF's titles.

Michael Peters
Plus Three, LP
Received on Tue Nov 22 07:17:58 2005