On Fri, Feb 11, 2005 at 10:52:14AM -0800, Shaffer, Chris wrote:
> Hi... I've gotten swish-e (using spider.pl) to crawl a couple of our
> intranet sites. The filters seem to be working okay for excel. And it
> seems to be looking at word documents. However, (using swish.cgi), I
> don't get any descriptions for those word docs.
..
> Any idea where I can look? I have no idea where to begin digging.
Sure. spider.pl just writes to stdout, so you can run it on a few
test docs and see what it outputs. Do it on a file that generates
a description and then another that doesn't and compare.
> StoreDescription HTML* <body> 200000
Make sure in the spider.pl output that the document's header is indeed
HTML*
$ SPIDER_QUIET=1 /usr/local/lib/swish-e/spider.pl default http://localhost/apache/test.doc | head
Path-Name: http://localhost/apache/test.doc
Content-Length: 1713
Last-Mtime: 1108148269
Document-Type: TXT*
That's saying the document is TXT*, so you would need to add another
StoreDescription line for TXT*
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Feb 11 11:03:25 2005