I'm trying to fix a problem with indexing HTML entities
since libxml2 is installed char entities are automatically converted.
I want to preserve entities so I thought that I could use the
ConvertHTMLEntities set to no and use the HTML internal parser
instead of HTML2 but when I run swish-e it responds with
"Using HTML2 parser ". Also, the descriptions are now missing
thanks for any help on this!
#################################
IndexFile /www/mysite
IndexDir spider.pl
SwishProgParameters /www/mysite.com/cgi-bin/mysite_english.spider.config
PropertyNames description
PropertyNamesMaxLength 1000 description
MetaNames description keywords swishdocpath swishtitle category
StoreDescription HTML <body> 200000
ConvertHTMLEntities no
DefaultContents HTML
IndexContents HTML .cfm .cfml .htm .html
ExtractPath category regex !^(http://)*[^/]*/([^/]+)/.*$!$2! #get 1st
directory name (dir)
IgnoreMetaTags script style
FileFilter .pdf pdftotext "'%p' -"
IndexContents HTML* .pdf
ReplaceRules regex !^(.*\?)(swishlang=[^&]+&*)(.*)?!$1$3!
###############################
my %serverA = (
base_url => 'http://www.mysite.com/index.cfm?swishlang=english',
same_hosts => [ qw/mysite.com/],
email => 'name@email.com',
keep_alive => 0,
use_md5 => 1,
max_files => 5,
use_cookies => 1,
);
@servers = ( \%serverA, );
###############################
swish-e -c /www/mysite.com/cgi-bin/mysite_english.cfg -S prog -e -v 3
Parsing config file '/www/mysite.com/cgi-bin/mysite_english.cfg'
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from
'/www/mysite.com/cgi-bin/mysite_english.spider.config'
http://www.mysite.com/index.cfm?swishlang=english - Using HTML2 parser -
(339 words)
http://www.mysite.com/index.cfm - Using HTML2 parser - (339 words)
http://www.mysite.com/landing.cfm - Using HTML2 parser - (264 words)
http://www.mysite.com/site_map/index.cfm - Using HTML2 parser - (103 words)
/usr/local/lib/swish-e/spider.pl: Max files Reached
Summary for: http://www.mysite.com/index.cfm?swishlang=english
Connection: Close: 6 (0.2/sec)
Duplicates: 175 (6.7/sec)
Off-site links: 10 (0.4/sec)
Total Bytes: 67,224 (2585.5/sec)
Total Docs: 5 (0.2/sec)
Unique URLs: 6 (0.2/sec)
http://www.mysite.com/about/index.cfm - Using HTML2 parser - (313 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 572 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
572 unique words indexed.
6 properties sorted.
5 files indexed. 67,224 total bytes. 1,420 total words.
Elapsed time: 00:00:26 CPU time: 00:00:00
Indexing done!
Received on Fri Feb 3 08:22:19 2006