I started using swish-e yesterday , and first impressions are very favourable.
Following Josh Rabinowitz' 'How to Index anything' I was able to index html and pdf files and then configure swish.cgi to get a web search form.
I'd now like to do the same for word docs.
Using Josh's howto-doc-prog.pl as a starting point
#!/usr/bin/perl -w
use pdf2xml;
my @files =
`find ./pdf/ -name '*.pdf' -print`;
for (@files) {
chomp();
my $xml_record_ref = pdf2xml($_);
# this is one XML file with a SWISH-E header
print $$xml_record_ref;
}
I've tried to build an eqiuvelent for word docs, I came up with
#!/usr/bin/perl -w
my @files =
`find ./doc/ -name '*.doc' -print`;
for (@files) {
chomp();
my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
# this is one XML file with a SWISH-E header
print $$xml_record_ref;
}
which when I run gives
100:sparrow.djl.co.uk{david}% swish-e -c howto-doc.conf -S prog
Warning: UseStemming is deprecated. See FuzzyIndexingMode in the docs
Indexing Data Source: "External-Program"
Indexing "./howto-doc-prog.pl"
External Program found: ./howto-doc-prog.pl
Warning: Unknown header line: 'Cricket Roundup' from program ./howto-doc-prog.plerr: External program failed to return required headers Path-Name:
.
101:sparrow.djl.co.uk{david}%
I guess the problem is that catdoc produces text and not xml.
Do I need to modify howto-doc.conf ?
I currently have
# howto-doc.conf
IndexDir ./howto-doc-prog.pl
IndexFile ./howto-doc.index
UseStemming yes
MetaNames swishtitle swishdocpath
Any ideas ?
Thanks
Received on Fri Dec 2 03:30:02 2005