Skip to main content.
home | support | download

Back to List Archive

Re: Trouble filtering xls with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Oct 22 2003 - 17:06:17 GMT
On Wed, Oct 22, 2003 at 09:31:09AM -0700, Bruce Pettyjohn wrote:
> I need help solving an Excel filtering problem.  The Filter.pm works just 
> fine with
> 
>          swish-filter-test -verbose ./test.xls
> 
> When using spider.pl and the standard SwishSpiderConfig.pl "filter_content" sub
> all excel files are bypassed while ".doc" files are filtered.

I'm not really following all the output below.  Here's some tests you can try:

1) Make sure we have the right content-type:

HEAD http://localhost/apache/party.xls
Connection: close
Date: Wed, 22 Oct 2003 16:59:55 GMT
Accept-Ranges: bytes
ETag: "6a6536b-3e00-3f96b622"
Server: Apache/1.3.28 (Debian GNU/Linux) mod_perl/1.28
Content-Length: 15872
Content-Type: application/vnd.ms-excel
Last-Modified: Wed, 22 Oct 2003 16:53:54 GMT
Client-Date: Wed, 22 Oct 2003 16:59:55 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1


2) Test with swish-filter-test program:

moseley(at)not-real.bumby:~$ swish-filter-test http://localhost/apache/party.xls

Document http://localhost/apache/party.xls was  filtered.
   Document:     http://localhost/apache/party.xls  (http://localhost/apache/party.xls)
   Content-Type: text/html
   Parser type:  HTML*

   >Filter used: SWISH::Filters::XLtoHTML=HASH(0x843f358) ( application/vnd.ms-excel -> text/html )


3) Now using spider.pl:

moseley(at)not-real.bumby:~$ /usr/local/lib/swish-e/spider.pl default http://localhost/apache/party.xls | head
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

Summary for: http://localhost/apache/party.xls
Total Bytes: 1,869  (155.8/sec)
 Total Docs:     1  (0.1/sec)
Unique URLs:     1  (0.1/sec)
Path-Name: http://localhost/apache/party.xls
Content-Length: 1869
Last-Mtime: 1066841634
Document-Type: HTML*

<html>    
<head>
    <title>Sheet1 - /tmp/ZdBdpEZNH2 v.1536</title>
    <meta name="Filename" content="/tmp/ZdBdpEZNH2">
    <meta name="Version" content="1536">

So that's working.  

Does your SwishSpiderConfig.pl filter on content-type?  

Is your server returning a different content type?

-- 
Bill Moseley
moseley@hank.org
Received on Wed Oct 22 17:06:36 2003