I'm taking small steps forward in working with swish-e. I have multiple configuration files (file1.conf, file2.conf) that create index.file1, index.file1.prop etc.
This works fine. I then use swish.cgi to search, no problems. This all works great. The contents of file1.conf and file2.conf follow.
File1.conf
IndexDir /app/swish/lib/swish-e/spider.pl
SwishProgParameters default http://10.20.172.100/doc/redhat-config-bind-2.0.0/
IndexFile /var/www/index.file1
ParserWarnLevel 3
FileFilter .pdf pdf2html "'%p' -"
IndexOnly HTML* .htm .html .asp
IndexContents HTML* .htm .html .shtml .pdf
IndexContents TXT* .txt .log .text
IndexContents XML* .xml
DefaultContents HTML
File2.conf
IndexDir /app/swish/lib/swish-e/spider.pl
SwishProgParameters default http://10.201.12.64
IndexFile /var/www/index.file2
Metanames swishtitle swishdocpath
StoreDescription TXT* 10000
StoreDescription HTML* <body> 10000
IndexContents HTML* .htm .html .asp
IndexContents TXT* .txt .log .text
IndexContents XML* .xml
My next step is to use is to use swishspider.conf like this 'swish-e -S prog -c swishspider.conf' The contents follow:
# Path to configuration file
SwishProgParameters /var/www/config.pl
# Path to spider.pl
IndexDir /app/swish/lib/swish-e/spider.pl
#
IndexOnly HTML* .htm .html .asp
FileFilter .pdf pdf2html "'%p' -"
IndexContents HTML* .htm .html .shtml .pdf
#
IndexContents TXT* .txt .log .text
#
IndexContents XML* .xml
#
DefaultContents HTML
I then created a config.pl, the contents follow:
# use lib '/app/swish/prog-bin';
# use pdf2html;
# sub pdf {
# my ( $uri, $server, $response, $content_ref ) = @_;
# return 1 unless $response->content_type eq 'application/pdf';
# $server->{counts}{'PDF transformed')++;
# $$content_ref = ${pdf2html( $content_ref, 'title' )};
# $$content_ref =~ tr/ / /s;
# return 1;
# }
my %serverA = (
base_url => 'http://10.201.12.64/',
email => 'allen.lung@ftb.ca.gov',
debug => DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED,
# link_tags => [qw/ a frame /],
# test_url => \&foo,
);
my %serverB = (
base_url => 'http://10.20.172.100/doc/redhat-config-bind-2.0.0/',
email => 'allen.lung@ftb.ca.gov',
# link_tags => [qw/ a frame /],
# test_url => \&foo,
);
@servers = ( \%serverA, \%serverB, );
# test_url => sub {
# my $uri->path =~ /\. (gif|jpeg|png|doc|pdf)$/;
# return 1;
# },
Is this the proper way to use the config.pl?
This is actually attempting to index .pdf and .doc files!
I do want to index .pdf, .doc and many others. The first files I want to index beyond what I'm doing now is .pdf! I hope I'm making sense here. I started this process with the code that has the #. Is this the proper location to do the callback subroutines?
Received on Thu Apr 15 11:50:22 2004