Skip to main content.
home | support | download

Back to List Archive

Re: Filtering MS Word Documents

From: Sebastian Jayaraj <jayaraj(at)not-real.kosan.com>
Date: Tue Oct 18 2005 - 19:28:53 GMT
Hi Bill,

Here's the output of what you suggested. I'm still clueless...apart from 
ID3toHTML the other filters are loaded.

------------------------------------------

[jayaraj@tnt jayaraj]$ swish-filter-test -v test.doc
SWISH::Filter found at [/usr/local/lib/swish-e/perl/SWISH/Filter.pm]


 >> Loading filter: [SWISH/Filters/Pdf2HTML.pm]
Find path of [pdftotext] in 
/usr/kerberos/bin:/bin:/usr/bin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/home/jayaraj/bin:/usr/local/lib/swish-e
  Not found at path [/usr/kerberos/bin/pdftotext]
  Not found at path [/bin/pdftotext]
 * Found program at: [/usr/bin/pdftotext]

Find path of [pdfinfo] in 
/usr/kerberos/bin:/bin:/usr/bin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/home/jayaraj/bin:/usr/local/lib/swish-e
  Not found at path [/usr/kerberos/bin/pdfinfo]
  Not found at path [/bin/pdfinfo]
 * Found program at: [/usr/bin/pdfinfo]


 >> Loading filter: [SWISH/Filters/ID3toHTML.pm]
trying to load [MP3::Tag]
Can not use Filter SWISH::Filters::ID3toHTML -- need to install 
MP3::Tag: No such file or directory

:-( Filter [SWISH/Filters/ID3toHTML.pm] not loaded


 >> Loading filter: [SWISH/Filters/XLtoHTML.pm]
trying to load [Spreadsheet::ParseExcel]
 ** Loaded Spreadsheet::ParseExcel **
trying to load [HTML::Entities]
 ** Loaded HTML::Entities **

 >> Loading filter: [SWISH/Filters/Doc2txt.pm]
Find path of [catdoc] in 
/usr/kerberos/bin:/bin:/usr/bin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/home/jayaraj/bin:/usr/local/lib/swish-e
  Not found at path [/usr/kerberos/bin/catdoc]
  Not found at path [/bin/catdoc]
  Not found at path [/usr/bin/catdoc]
 * Found program at: [/usr/local/bin/catdoc]


 >> Starting to process new document: application/x-msword
 ++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x805f83c)] for 
application/x-msword
 ++ application/x-msword was not filtered by 
SWISH::Filters::Pdf2HTML=HASH(0x805f83c)

 ++Checking filter [SWISH::Filters::XLtoHTML=HASH(0x83493c4)] for 
application/x-msword
 ++ application/x-msword was not filtered by 
SWISH::Filters::XLtoHTML=HASH(0x83493c4)

 ++Checking filter [SWISH::Filters::Doc2txt=HASH(0x835f794)] for 
application/x-msword
 ++ application/x-msword was not filtered by 
SWISH::Filters::Doc2txt=HASH(0x835f794)


Final Content type for test.doc is application/x-msword
  *No filters were used

Document test.doc was not filtered.
   Document:     test.doc  (test.doc)
   Content-Type: application/x-msword
   Parser type:

** /usr/local/bin/swish-filter-test:
  Skipping binary [test.doc]

------------------------------------------------------------------


Bill Moseley wrote:

>On Fri, Oct 14, 2005 at 02:54:50PM -0700, Sebastian Jayaraj wrote:
>  
>
>>Hello All,
>>
>> I have been using swish-e for a while and it works beautifully while 
>>indexing PDF and XL files. I was trying to index MS word files and only 
>>the filenames were being indexed. So I tried a simple swish-filter-test 
>>and found this....
>>
>>-------------------------------------------------
>>[root@tnt filters]# catdoc -V
>>Catdoc Version 0.93.3
>>[root@tnt filters]# swish-e -V
>>SWISH-E 2.4.2
>>[root@tnt filters]# swish-filter-test test.doc
>>
>>Document test.doc was not filtered.
>>   Document:     test.doc  (test.doc)
>>   Content-Type: application/x-msword
>>   Parser type:
>>
>>** /usr/local/bin/swish-filter-test:
>>  Skipping binary [test.doc]
>>------------------------------------------------
>>
>>Catdoc by itself works fine and is in the right path. Any pointers or 
>>suggestions would be helpful.
>>    
>>
>
>One suggestion would be to try the above with the -v option.
>And maybe run as a normal user instead of root.
>
>  
>
Received on Tue Oct 18 12:29:23 2005