On Tue, 11 Feb 2003, Michael REMY wrote:
That doc2txt.pm module is suppose to be called from another script --
such as from the spider.pl program. For example, spider.pl fetches a
document, sees that's it's a MS Word doc (from the content type) and then
uses the doc2txt.pm *module* to convert it to text and returns the
document.
And doc2txt.pm can be used two ways: One way is if you pass in a
reference to a doc (the MS Word doc is already in memory as is the case
for spider.pl) it simply returns the converted doc as a reference to a
scalar. To avoid a double-ended pipe doc2txt.pm writes the file in memory
out to a temp file instead of piping directly to catdoc. (That's probably
not necessary.)
The other way to use doc2txt.pm is when passing in a file name. For
example, if you are scanning a local directory of files, when you see a
.doc file you can pass the file name to doc2txt.pm and it will return not
only the converted file, but the headers required for use with swish-e's
-S prog. Indeed, doc2txt.pm is designed only for use with -S prog type of
input, not as a stand alone program to convert MS Word files.
> to solve my problem with the doc2txt.pm, i had been done to add this lines
> before the doc2txt sub in your script :
>
> my $file = shift || die "Usage: $0 <filename>\n";
> system("catdoc -a $file > /tmp/toto.txt");
> system("cat /tmp/toto.txt");
> system("unlink /tmp/toto.txt");
>
> sub doc2txt {........etc.
Well, I don't understand that. You are using system() when you should be
using backticks.
my $doc = `catdoc -a $file`;
See perldoc perlfaq8
Why can't I get the output of a command with system()?
> swish-e -cind_138.conf -l -v 3 -T
(Note: -T requires a paramater).
You seem to want to convert that module into a program that just converts
MS Word files, or so I assume. The doc2txt.pm module is used for -S prog
programs and you are not using it as such (no -S prog in your command
above).
If your goal is to convert .doc files to text, then again all you need to
do is use a FileFilter entry:
>From the example in the documentation:
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
You don't need a perl program to help with that, and using a perl program
will just slow indexing down.
--
Bill Moseley moseley@hank.org
Received on Tue Feb 11 15:24:06 2003