Skip to main content.
home | support | download

Back to List Archive

Re: -b with multiple indexes.

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Oct 12 2001 - 20:44:55 GMT
I'm cc'ing the list, in case it helps anyone else.

At 11:17 AM 10/12/01 -0700, you wrote:
>I'm confused as to where I put the pdf2xml subroutine.  Does that need
>to go in the spider.pl file?

There's two perl modules in the prog-bin directory:  pdf2xml.pm and
pdf2html.pm.  They basically do the same thing.  You don't have to put it
anyplace.  You just load the module from your spider config file.

use pdf2xml;

If the module is not in perl's @INC path you must tell perl where to look:

use lib '/home/gklass/swish-e/prog-bin';
use pdf2xml;

Now the function is imported.

In the SwishSpiderConfig.pl example you will see one of the configs showing

      filter_content  => [ \&pdf, \&doc ],


and &pdf and &doc are just subroutines defined also in that file.  Here's
using pdf2html, but it's the same procedure for pdf2xml:

use pdf2html;  # included example pdf converter module
sub pdf {
   my ( $uri, $server, $response, $content_ref ) = @_;

   return 1 unless $response->content_type eq 'application/pdf';

   # for logging counts
   $server->{counts}{'PDF transformed'}++;

   $$content_ref = ${pdf2html( $content_ref, 'title' )};
   $$content_ref =~ tr/ / /s;
   return 1;
}

I think the tr/ / /s is silly and not needed (it just thins out white space
before sending to swish.  

>From the prog-bin directory you can type perldoc pdf2html and perldoc
pdf2xml for more info.

pdfxml was written first, then pdf2html was written so that we could trick
swish into finding and storing the title.  But, now swish can alias meta
and property names, so you can have a tag like <title> in xml, and tell
swish to index that as <swishdoctitle> which is what html titles are
indexed as.  That way xml and html results will both show a title.  You can
also just use <swishdoctitle> in xml do the same thing.

Does that help?



Bill Moseley
mailto:moseley@hank.org
Received on Fri Oct 12 20:45:28 2001