I have been attempting to use the pdf2xml program without any success. I
am running SWISH-E and copied a program (first item immediatelly below) from
www.linuxjournal.com/articles.php?sid=6652 which uses this program to index
the PDF files. The SWISH-E program works properly from both a browser and
the command line when indexing regular text files.
(howto-pdf-prog.pl file)
#!/usr/bin/perl -w
use pdf2xml;
my @files =
system ('find /var/www/html/ccsp/docs/ -name *.pdf -print');
# system ('find /var/www/html/ccsp/docs/ -name *.pdf >
/var/www/html/ccsp/docs/results.file');
for (@files) {
chomp();
my $xml_record_ref = pdf2xml($_);
print $$xml_record_ref;
}
The following is the configuration file I am using:
(howto-pdf-conf file)
IndexDir ./howto-pdf-prog.pl
# prog file to hand us XML docs
IndexFile ./howto-pdf.index
# Index to create
UseStemming yes
MetaNames swishtitle swishdocpath+
When executed, the following is the result:
[root@DPA2 ccsp]# swish-e -c howto-pdf.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./howto-pdf-prog.pl"
Error: Couldn't open file '65280'
./howto-pdf-prog.pl: Failed close on pipe to pdfinfo for 65280: 256 at
pdf2xml.pm line 129.
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
I and my tech support cannot figure out what "..file 65280.." is. There is
no such filename anywhere on the server and it is not a PDF file in our test
directory (../ccsp/docs/). We are at a loss as to what to do next.
Has anyone else experienced a smimilar problem who can help? Thank you for
any assistance you can provide.
Wayne Schomaker
303-239-4394
Received on Wed Nov 12 20:13:29 2003