Skip to main content.
home | support | download

Back to List Archive

Re: indexing Msoft Word docs

From: David Larkin <david.larkin(at)not-real.djl.co.uk>
Date: Fri Dec 02 2005 - 12:27:41 GMT
On Fri, 2 Dec 2005 03:29:47 -0800 (PST)
David Larkin <david.larkin@djl.co.uk> wrote:



I found doc2txt.pm and this works


#!/usr/bin/perl -w
use doc2txt;

my @files =
    `find ./doc/ -name '*.doc' -print`;
for (@files) {
    chomp();
#    my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
    my $xml_record_ref = doc2txt($_);
    # this is one XML file with a SWISH-E header
    print $$xml_record_ref;
}

;-)

guess i should really rename the $xml_record_ref variable, but i can now search using swish.cgi, which is great 


> I started using swish-e yesterday , and first impressions are very favourable.
> 
> Following Josh Rabinowitz' 'How to Index anything' I was able to index html and pdf files and then configure swish.cgi to get a web search form.
> 
> I'd now like to do the same for word docs.
> 
> Using Josh's howto-doc-prog.pl as a starting point

I meant to say howto-doc-prog.pl

> 
> #!/usr/bin/perl -w
> use pdf2xml;
> my @files =
>     `find ./pdf/ -name '*.pdf' -print`;
> for (@files) {
>     chomp();
>     my $xml_record_ref = pdf2xml($_);
>     # this is one XML file with a SWISH-E header
>     print $$xml_record_ref;
> }
> 
> I've tried to build an eqiuvelent for word docs, I came up with
> 
> #!/usr/bin/perl -w
> 
> my @files =
>     `find ./doc/ -name '*.doc' -print`;
> for (@files) {
>     chomp();
>     my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
>     # this is one XML file with a SWISH-E header
>     print $$xml_record_ref;
> }
> 
> which when I run gives
> 
> 100:sparrow.djl.co.uk{david}% swish-e -c howto-doc.conf -S prog
> 
> Warning: UseStemming is deprecated.  See FuzzyIndexingMode in the docs
> Indexing Data Source: "External-Program"
> Indexing "./howto-doc-prog.pl"
> External Program found: ./howto-doc-prog.pl
> 
> Warning: Unknown header line: 'Cricket Roundup' from program ./howto-doc-prog.plerr: External program failed to return required headers Path-Name:
> .
> 101:sparrow.djl.co.uk{david}%
> 
> 
> I guess the problem is that catdoc produces text and not xml.
> 
> Do I need to modify howto-doc.conf ?
> 
> I currently have
> 
> # howto-doc.conf
> 
> IndexDir ./howto-doc-prog.pl
> 
> IndexFile ./howto-doc.index
> 
> UseStemming	yes
> MetaNames	swishtitle	swishdocpath
> 
> Any ideas ?
> 
> Thanks
Received on Fri Dec 2 04:27:47 2005