Skip to main content.
home | support | download

Back to List Archive

Re: Indexing .doc .ppt .xls with filters and prog method

From: Benoit Guguin <liste(at)not-real.alixen.fr>
Date: Fri Aug 19 2005 - 12:58:17 GMT
So I try another way ton index files :

in my swish.conf I use for all files the rule FilterMatch like this :


FileFilter .pdf /usr/bin/pdftotext "'%p' -"
IndexContents TXT .pdf

FileFilter .doc /usr/bin/catdoc "-s8859-1 -d8859-1 '%p'"
IndexContents TXT .doc

FileFilterMatch .ppt "/usr/bin/ppthtml" "'%p'"
IndexContents HTML .html .ppt
StoreDescription HTML* <test:p> 20000


FileFilterMatch "/usr/bin/unzip" "-p \"%p\" content.xml" 
/\.(sxw|sxc|sxg|sxi)$/i
IndexContents XML* .xml .sxw .sxc .sxg .sxi
StoreDescription XML* <text:p> 20000

FileFilterMatch .xls /usr/bin/xlhtml  "'%p'"
IndexContents HTML .html .xls
StoreDescription HTML* <test:p> 20000


Benoit Guguin a écrit :

>Ok thank you,
>
>I Have tested with Dirtree.pl and it's works fine with xls, pdf and doc.
>
>So I'm currently looking to add filter for  powerpoint and openoffice 
>(sxi, sxw, sxc). But I don't understand the source code  :( ...
>
>If someone already do this, can he give us the file please ?
>
>
>Thanks again,
>
>Regards,
>
>Peter Karman a écrit :
>
>  
>
>>The .pm files:
>>
>> doc2txt.pm
>> pdf2html.pm
>> pdf2xml.pm
>>
>>are example modules that predate (iirc) the SWISH::Filters class. The reason 
>>pdf2html works in your script is this line in the pdf2html.pm file:
>>
>>  @EXPORT = qw(pdf2html);
>>
>>which tells Perl to make that function available in your script's namespace with 
>>the 'use' function.
>>
>>I'd suggest using the DirTree.pl example script instead; it calls SWISH::Filter 
>>for you correctly.
>>
>>Benoit Guguin scribbled on 8/19/05 4:45 AM:
>>
>> 
>>
>>    
>>
>>>Hello,
>>>
>>>I try to index a directory with only pdf, doc, xls and ppt.
>>>
>>>
>>>I've seen in version 2.5.4 some perl script to filter .ppt, .xls and .doc. 
>>>
>>>I try to use them  with the prog method but when I run swish-e ( 
>>>"swish-e -c /etc/swish-e/swish.conf -S prog") I have thoses erros :
>>>
>>>Undefined subroutine &main::Doc2html called at /etc/swish-e/swish.pl 
>>>line 55.
>>>Or
>>>Undefined subroutine &main::pp2hml called at /etc/swish-e/swish.pl
>>>
>>>The error depends of the order of the functions.
>>>
>>>
>>>So I don't undestand  why it's work fine for pdf but not for others 
>>>format...
>>>
>>>I'm looking around ml archive but dont find my St Graal;)
>>>
>>>Any idea please ?
>>>
>>>Regards,
>>>
>>>
>>>   
>>>
>>>      
>>>
>> 
>>
>>    
>>
>
>
>  
>


-- 
Guguin Benoit
Société Alixen 2 rue Jean Rostand 91 893 Orsay Cedex France
Tel : 01 69 85 24 13, Fax : 01 69 85 24 10
Received on Fri Aug 19 05:58:24 2005