Skip to main content.
home | support | download

Back to List Archive

Trouble filtering xls with spider.pl

From: Bruce Pettyjohn <bruce.pettyjohn(at)not-real.varianinc.com>
Date: Wed Oct 22 2003 - 16:32:00 GMT
I need help solving an Excel filtering problem.  The Filter.pm works just 
fine with

         swish-filter-test -verbose ./test.xls

When using spider.pl and the standard SwishSpiderConfig.pl "filter_content" sub
all excel files are bypassed while ".doc" files are filtered.  The 
"XLtoHTML.pm"
package does not seem to be reached using the spider.pl but is with the
swish-filter-test program.

Any ideas on how I might solve the problem?

Any guidance to resolve would be much appreciated.

Thanks,
Bruce


Here is my environment:
=================
swish-e version:  swish-e-2.4.0-pr4
os:  Solaris 8


Here is the results:
===============
./spider.pl > x
./spider.pl: Reading parameters from 
'/usr/varian/search/eportal/kla/SwishSpiderConfig.pl'

 >> Loading filter: [SWISH/Filters/Doc2txt.pm]
Find path of [catdoc] in /usr/sbin:/usr/bin:/usr/local/lib/swish-e
   Not found at path [/usr/sbin/catdoc]
  * Found program at: [/usr/bin/catdoc]

 >> Loading filter: [SWISH/Filters/Pdf2HTML.pm]
Find path of [pdftotext] in /usr/sbin:/usr/bin:/usr/local/lib/swish-e
   Not found at path [/usr/sbin/pdftotext]
  * Found program at: [/usr/bin/pdftotext]

Find path of [pdfinfo] in /usr/sbin:/usr/bin:/usr/local/lib/swish-e
   Not found at path [/usr/sbin/pdfinfo]
  * Found program at: [/usr/bin/pdfinfo]

 >> Loading filter: [SWISH/Filters/ID3toHTML.pm]
trying to load [MP3::Tag]
Can not use Filter SWISH::Filters::ID3toHTML -- need to install MP3::Tag: 
No such file or directory

:-( Filter [SWISH/Filters/ID3toHTML.pm] not loaded

 >> Loading filter: [SWISH/Filters/XLtoHTML.pm]
trying to load [Spreadsheet::ParseExcel]
  ** Loaded Spreadsheet::ParseExcel **
trying to load [HTML::Entities]
  ** Loaded HTML::Entities **


spider.pl processing a word doc including added debug notes
============================================
Filter.pm:  328:  self->{filters} = ARRAY(0x9054fc) | attr = %attr
Filter.pm:  337:  content_type = application/msword
Filter.pm:  353: doc = SCALAR(0x31cfd8)
Filter.pm:  376:  doc_object = SWISH::Filter::document=HASH(0x53801c)
Filter.pm:  458:  self->{filters} = ARRAY(0x5b5bec)
Filter.pm:  465:  doc_object = SWISH::Filter::document=HASH(0x53801c)
SwishSpiderConfig.pl:  294:  was_filtered = [1] application/msword
Path-Name: 
http://www.varianinc.com/image/vimage/docs/products/vacuum/kla/shared/FAR_9699049S020_100025.doc
Content-Length: 458
Last-Mtime: 1066434913
Document-Type: TXT*


spider.pl processing an excel doc including added debug notes
============================================
Filter.pm:  328:  self->{filters} = ARRAY(0x3e1a14) | attr = %attr
Filter.pm:  337:  content_type = application/vnd.ms-excel
Filter.pm:  353: doc = SCALAR(0x31cfd8)
Filter.pm:  376:  doc_object = SWISH::Filter::document=HASH(0x3e1b1c)
Filter.pm:  458:  self->{filters} = ARRAY(0x3e1c24)
Filter.pm:  465:  doc_object = SWISH::Filter::document=HASH(0x3e1b1c)
SwishSpiderConfig.pl:  294:  was_filtered = [0] application/vnd.ms-excel


swish-filter-test on local file - (works on http as well)
=====================================
swish-filter-test -verbose ./test.xls
SWISH::Filter found at [/usr/local/lib/swish-e/perl/SWISH/Filter.pm]

 >> Loading filter: [SWISH/Filters/Doc2txt.pm]
Find path of [catdoc] in 
/usr/sbin:/usr/bin:/usr/local/bin:/usr/ccs/bin:/usr/include:/usr/ucbinclude:/usr/local/lib/swish-e
   Not found at path [/usr/sbin/catdoc]
  * Found program at: [/usr/bin/catdoc]

 >> Loading filter: [SWISH/Filters/Pdf2HTML.pm]
Find path of [pdftotext] in 
/usr/sbin:/usr/bin:/usr/local/bin:/usr/ccs/bin:/usr/include:/usr/ucbinclude:/usr/local/lib/swish-
e
   Not found at path [/usr/sbin/pdftotext]
  * Found program at: [/usr/bin/pdftotext]

Find path of [pdfinfo] in 
/usr/sbin:/usr/bin:/usr/local/bin:/usr/ccs/bin:/usr/include:/usr/ucbinclude:/usr/local/lib/swish-e
   Not found at path [/usr/sbin/pdfinfo]
  * Found program at: [/usr/bin/pdfinfo]

 >> Loading filter: [SWISH/Filters/ID3toHTML.pm]
trying to load [MP3::Tag]
Can not use Filter SWISH::Filters::ID3toHTML -- need to install MP3::Tag: 
No such file or directory

:-( Filter [SWISH/Filters/ID3toHTML.pm] not loaded

 >> Loading filter: [SWISH/Filters/XLtoHTML.pm]
trying to load [Spreadsheet::ParseExcel]
  ** Loaded Spreadsheet::ParseExcel **
trying to load [HTML::Entities]
  ** Loaded HTML::Entities **

Filter.pm:  328:  self->{filters} = ARRAY(0x79d984) | attr = %attr
Filter.pm:  337:  content_type =
Filter.pm:  353: doc = ./test.xls
Filter.pm:  376:  doc_object = SWISH::Filter::document=HASH(0x834490)
37: SWISH::Filter - filter->content_type = 
[SWISH::Filter::document=HASH(0x834490)->content_type]
45: SWISH::Filter - file = [./test.xls]
125:  SWISH::Filters::XLtoHTML <html>
<head>
     <title>Apr 03 - ./test.xls v.1536</title>
     <meta name="Filename" content="./test.xls">
     <meta name="Version" content="1536">



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Wed Oct 22 16:32:24 2003