Skip to main content.
home | support | download

Back to List Archive

Re: Creating pdf index

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Mar 16 2004 - 23:42:09 GMT
On Tue, Mar 16, 2004 at 03:20:41PM -0800, Lung.Allen wrote:
> 
> 
> What does running:
> 
>   ./howto-pdf-prog.pl | head
> 
> show?
> 
>  ./howto-pdf-prog.pl | head
> SCALAR(0x80779a8)SCALAR(0x80a91d4)SCALAR(0x804ca20)SCALAR(0x80a5b10)SCALAR(0x80a9204)SCALAR(0x80a92dc)SCALAR(0x80a

Looks like you have a program that is not working correctly.
Do you have any experience with Perl?

I copied this from the Linux article (adding the use lib path) and it seems to work ok.
Is this the script you are using?

eley@bumby:~$ cat x.pl
#!/usr/bin/perl -w
use lib '/home/moseley/123/swish-e-2.4.1/prog-bin';
use pdf2xml;


my @files =
    `find /usr/share/cups/doc-root -name 'i*.pdf' -print`;
for (@files) {
    chomp();
    my $xml_record_ref = pdf2xml($_);
    # this is one XML file with a SWISH-E header
    print $$xml_record_ref;
}

moseley@bumby:~$ perl x.pl | swish-e -S prog -i stdin  -T properties
Indexing Data Source: "External-Program"
Indexing "stdin"
          swishdocpath: 6 ( 35) S: "/usr/share/cups/doc-root/ja/idd.pdf"
            swishtitle: 7 ( 33) S: "CUPS Interface Design Description"
          swishdocsize: 8 (  4) N: "43274"
     swishlastmodified: 9 (  4) D: "2003-11-14 08:31:42 PST"
          swishdocpath: 6 ( 35) S: "/usr/share/cups/doc-root/ja/ipp.pdf"
            swishtitle: 7 ( 26) S: "CUPS Implementation of IPP"
          swishdocsize: 8 (  4) N: "70011"
     swishlastmodified: 9 (  4) D: "2003-11-14 08:31:43 PST"
          swishdocpath: 6 ( 32) S: "/usr/share/cups/doc-root/idd.pdf"
            swishtitle: 7 ( 33) S: "CUPS Interface Design Description"
          swishdocsize: 8 (  4) N: "43274"
     swishlastmodified: 9 (  4) D: "2004-03-05 04:00:48 PST"
          swishdocpath: 6 ( 32) S: "/usr/share/cups/doc-root/ipp.pdf"
            swishtitle: 7 ( 26) S: "CUPS Implementation of IPP"
          swishdocsize: 8 (  4) N: "70011"
     swishlastmodified: 9 (  4) D: "2004-03-05 04:00:48 PST"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1,564 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1,564 unique words indexed.
4 properties sorted.                                              
4 files indexed.  226,570 total bytes.  30,530 total words.
Elapsed time: 00:00:02 CPU time: 00:00:01
Indexing done!

-- 
Bill Moseley
moseley@hank.org
Received on Tue Mar 16 15:42:09 2004