Hi... I've gotten swish-e (using spider.pl) to crawl a couple of our
intranet sites. The filters seem to be working okay for excel. And it
seems to be looking at word documents. However, (using swish.cgi), I
don't get any descriptions for those word docs.
I'm calling swish-e with 'swish-e -c swish.config -S prog' on a fedora
core 2 box.
I've installed catdoc, excel perl modules, and xpdf. Swish-filter-test
seems to work fine.
Any idea where I can look? I have no idea where to begin digging. Here
are my config files, if that'll help:
Spider.config
--------------------
my ($filter_sub, $response_sub ) =3D swish_filter();
my %main_site =3D (
base_url =3D> 'http://.../',
email =3D> 'chris.shaffer@bellsouth.com',
debug =3D> 'errors, url, info',
delay_sec =3D> 0,
test_response =3D> $response_sub,
filter_content =3D> $filter_sub,
);
my %bstcad_site =3D (
base_url =3D> 'http://..../',
email =3D> 'chris.shaffer@bellsouth.com',
debug =3D> 'errors, url, info',
delay_sec =3D> 0,
test_response =3D> $response_sub,
filter_content =3D> $filter_sub,
);
my %ecars_site =3D (
base_url =3D> 'http://.../',
email =3D> 'chris.shaffer@bellsouth.com',
debug =3D> 'errors, url, info',
delay_sec =3D> 0,
test_response =3D> $response_sub,
filter_content =3D> $filter_sub, =20
);
@servers =3D (\%ecars_site, \%bstcad_site, \%main_site);
1;
---------------------
Swish.config
------------------------
# Use spider.pl as the external program:
IndexDir spider.pl
IndexOnly .html .htm .xml .doc .pdf .xls .ppt
DefaultContents HTML*
StoreDescription HTML* <body> 200000
# And pass the name of the spider config file to the spider:
SwishProgParameters spider.config
-------------------------
Chris Shaffer
Application Developer, BSTCAD/BSTProcess
BSTCAD Support Forums
<http://forums.ecars.bst.bls.com/viewforum.php?f=3D2>=20
chris.shaffer@bellsouth.com
(404) 927-1227
*****
"The information transmitted is intended only for the person or entity =
to which it is addressed and may contain confidential, proprietary, =
and/or privileged material. Any review, retransmission, dissemination =
or other use of, or taking of any action in reliance upon, this =
information by persons or entities other than the intended recipient is =
prohibited. If you received this in error, please contact the sender =
and delete the material from all computers." 118
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Fri Feb 11 10:53:53 2005