I am spidering a site (spidering is being called from the swish indexing).
The site contains .exe and .zip files. I DO NOT want those files to be
indexed (or even downloaded). Here is my command line for swish indexing:
swish-e -S prog -c swish.config
How do I have it NOT index .exe and .zip files? (below is listed my config
files). I even have some entries in my robots.txt file that I thought would
keep the files from being spidered but that isn't working either.
lance
--swish.config--
#--- swish.config:
#--- where is the spider proggie?
IndexDir /home/perry/Soft/swish/lib/swish-e/spider.pl
#--- configuration for the spider
SwishProgParameters ccenter.config
#--- swish index file
IndexFile info.index
PropertyNames title
#--- grab the body to store (to be searchable)
StoreDescription HTML2 <body> 20000
#--- index all these guys
IndexContents HTML2 .html .htm .php .pdf .doc .xls .ppt
#--- File Filter for pdf files
FileFilter .pdf /home/perry/WebTools/bin/pdftohtml "'%p' -stdout -q -noframes"
#--- File Filter for doc files
FileFilter .doc /home/perry/WebTools/bin/catdoc "-s8859-1 -d8859-1 '%p'"
#--- File Filter for xls files
FileFilter .xls /home/perry/WebTools/bin/xlhtml "'%p'"
#--- File Filter for ppt files
FileFilter .ppt /home/perry/WebTools/bin/ppthtml "'%p'"
--end of swish.config--
--ccenter.config--
my %ccenter = (
email => 'Lance.Perry@ourdomain.com',
base_url => 'http://our.domain.com/ccenter/',
delay_sec => '0',
max_depth => '1',
credentials => 'username:password'
);
@servers = ( \%ccenter );
--end of ccenter.config--
--robots.txt--
User-agent: *
Disallow: /downloads/cisco-vpn/*.exe$
User-agent: *
Disallow: /downloads/cisco-vpn/*.zip$
User-agent: *
Disallow: /downloads/cisco-vpn/*.tar$
User-agent: *
Disallow: /downloads/cisco-vpn/*.gz$
User-agent: *
Disallow: /downloads/cisco-vpn/*.dmg$
--end of robots.txt--
Received on Wed Jan 5 12:16:02 2005