I need help solving an Excel filtering problem. The Filter.pm works just
fine with
swish-filter-test -verbose ./test.xls
When using spider.pl and the standard SwishSpiderConfig.pl "filter_content" sub
all excel files are bypassed while ".doc" files are filtered. The
"XLtoHTML.pm"
package does not seem to be reached using the spider.pl but is with the
swish-filter-test program.
Any ideas on how I might solve the problem?
Any guidance to resolve would be much appreciated.
Thanks,
Bruce
Here is my environment:
=================
swish-e version: swish-e-2.4.0-pr4
os: Solaris 8
Here is the results:
===============
./spider.pl > x
./spider.pl: Reading parameters from
'/usr/varian/search/eportal/kla/SwishSpiderConfig.pl'
>> Loading filter: [SWISH/Filters/Doc2txt.pm]
Find path of [catdoc] in /usr/sbin:/usr/bin:/usr/local/lib/swish-e
Not found at path [/usr/sbin/catdoc]
* Found program at: [/usr/bin/catdoc]
>> Loading filter: [SWISH/Filters/Pdf2HTML.pm]
Find path of [pdftotext] in /usr/sbin:/usr/bin:/usr/local/lib/swish-e
Not found at path [/usr/sbin/pdftotext]
* Found program at: [/usr/bin/pdftotext]
Find path of [pdfinfo] in /usr/sbin:/usr/bin:/usr/local/lib/swish-e
Not found at path [/usr/sbin/pdfinfo]
* Found program at: [/usr/bin/pdfinfo]
>> Loading filter: [SWISH/Filters/ID3toHTML.pm]
trying to load [MP3::Tag]
Can not use Filter SWISH::Filters::ID3toHTML -- need to install MP3::Tag:
No such file or directory
:-( Filter [SWISH/Filters/ID3toHTML.pm] not loaded
>> Loading filter: [SWISH/Filters/XLtoHTML.pm]
trying to load [Spreadsheet::ParseExcel]
** Loaded Spreadsheet::ParseExcel **
trying to load [HTML::Entities]
** Loaded HTML::Entities **
spider.pl processing a word doc including added debug notes
============================================
Filter.pm: 328: self->{filters} = ARRAY(0x9054fc) | attr = %attr
Filter.pm: 337: content_type = application/msword
Filter.pm: 353: doc = SCALAR(0x31cfd8)
Filter.pm: 376: doc_object = SWISH::Filter::document=HASH(0x53801c)
Filter.pm: 458: self->{filters} = ARRAY(0x5b5bec)
Filter.pm: 465: doc_object = SWISH::Filter::document=HASH(0x53801c)
SwishSpiderConfig.pl: 294: was_filtered = [1] application/msword
Path-Name:
http://www.varianinc.com/image/vimage/docs/products/vacuum/kla/shared/FAR_9699049S020_100025.doc
Content-Length: 458
Last-Mtime: 1066434913
Document-Type: TXT*
spider.pl processing an excel doc including added debug notes
============================================
Filter.pm: 328: self->{filters} = ARRAY(0x3e1a14) | attr = %attr
Filter.pm: 337: content_type = application/vnd.ms-excel
Filter.pm: 353: doc = SCALAR(0x31cfd8)
Filter.pm: 376: doc_object = SWISH::Filter::document=HASH(0x3e1b1c)
Filter.pm: 458: self->{filters} = ARRAY(0x3e1c24)
Filter.pm: 465: doc_object = SWISH::Filter::document=HASH(0x3e1b1c)
SwishSpiderConfig.pl: 294: was_filtered = [0] application/vnd.ms-excel
swish-filter-test on local file - (works on http as well)
=====================================
swish-filter-test -verbose ./test.xls
SWISH::Filter found at [/usr/local/lib/swish-e/perl/SWISH/Filter.pm]
>> Loading filter: [SWISH/Filters/Doc2txt.pm]
Find path of [catdoc] in
/usr/sbin:/usr/bin:/usr/local/bin:/usr/ccs/bin:/usr/include:/usr/ucbinclude:/usr/local/lib/swish-e
Not found at path [/usr/sbin/catdoc]
* Found program at: [/usr/bin/catdoc]
>> Loading filter: [SWISH/Filters/Pdf2HTML.pm]
Find path of [pdftotext] in
/usr/sbin:/usr/bin:/usr/local/bin:/usr/ccs/bin:/usr/include:/usr/ucbinclude:/usr/local/lib/swish-
e
Not found at path [/usr/sbin/pdftotext]
* Found program at: [/usr/bin/pdftotext]
Find path of [pdfinfo] in
/usr/sbin:/usr/bin:/usr/local/bin:/usr/ccs/bin:/usr/include:/usr/ucbinclude:/usr/local/lib/swish-e
Not found at path [/usr/sbin/pdfinfo]
* Found program at: [/usr/bin/pdfinfo]
>> Loading filter: [SWISH/Filters/ID3toHTML.pm]
trying to load [MP3::Tag]
Can not use Filter SWISH::Filters::ID3toHTML -- need to install MP3::Tag:
No such file or directory
:-( Filter [SWISH/Filters/ID3toHTML.pm] not loaded
>> Loading filter: [SWISH/Filters/XLtoHTML.pm]
trying to load [Spreadsheet::ParseExcel]
** Loaded Spreadsheet::ParseExcel **
trying to load [HTML::Entities]
** Loaded HTML::Entities **
Filter.pm: 328: self->{filters} = ARRAY(0x79d984) | attr = %attr
Filter.pm: 337: content_type =
Filter.pm: 353: doc = ./test.xls
Filter.pm: 376: doc_object = SWISH::Filter::document=HASH(0x834490)
37: SWISH::Filter - filter->content_type =
[SWISH::Filter::document=HASH(0x834490)->content_type]
45: SWISH::Filter - file = [./test.xls]
125: SWISH::Filters::XLtoHTML <html>
<head>
<title>Apr 03 - ./test.xls v.1536</title>
<meta name="Filename" content="./test.xls">
<meta name="Version" content="1536">
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Wed Oct 22 16:32:24 2003