On Wed, Oct 22, 2003 at 09:31:09AM -0700, Bruce Pettyjohn wrote:
> I need help solving an Excel filtering problem. The Filter.pm works just
> fine with
>
> swish-filter-test -verbose ./test.xls
>
> When using spider.pl and the standard SwishSpiderConfig.pl "filter_content" sub
> all excel files are bypassed while ".doc" files are filtered.
I'm not really following all the output below. Here's some tests you can try:
1) Make sure we have the right content-type:
HEAD http://localhost/apache/party.xls
Connection: close
Date: Wed, 22 Oct 2003 16:59:55 GMT
Accept-Ranges: bytes
ETag: "6a6536b-3e00-3f96b622"
Server: Apache/1.3.28 (Debian GNU/Linux) mod_perl/1.28
Content-Length: 15872
Content-Type: application/vnd.ms-excel
Last-Modified: Wed, 22 Oct 2003 16:53:54 GMT
Client-Date: Wed, 22 Oct 2003 16:59:55 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1
2) Test with swish-filter-test program:
moseley(at)not-real.bumby:~$ swish-filter-test http://localhost/apache/party.xls
Document http://localhost/apache/party.xls was filtered.
Document: http://localhost/apache/party.xls (http://localhost/apache/party.xls)
Content-Type: text/html
Parser type: HTML*
>Filter used: SWISH::Filters::XLtoHTML=HASH(0x843f358) ( application/vnd.ms-excel -> text/html )
3) Now using spider.pl:
moseley(at)not-real.bumby:~$ /usr/local/lib/swish-e/spider.pl default http://localhost/apache/party.xls | head
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
Summary for: http://localhost/apache/party.xls
Total Bytes: 1,869 (155.8/sec)
Total Docs: 1 (0.1/sec)
Unique URLs: 1 (0.1/sec)
Path-Name: http://localhost/apache/party.xls
Content-Length: 1869
Last-Mtime: 1066841634
Document-Type: HTML*
<html>
<head>
<title>Sheet1 - /tmp/ZdBdpEZNH2 v.1536</title>
<meta name="Filename" content="/tmp/ZdBdpEZNH2">
<meta name="Version" content="1536">
So that's working.
Does your SwishSpiderConfig.pl filter on content-type?
Is your server returning a different content type?
--
Bill Moseley
moseley@hank.org
Received on Wed Oct 22 17:06:36 2003