Dr Michael Daly wrote on 3/16/12 9:03 AM:
> I am invoking indexing via
> swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
> ********************************************************************************************************************************************************
> web_2.conf contents:
> IndexDir spider.pl
> SwishProgParameters /share/MD0_DATA/swish-e-files/swish-e-conf/spider.config
>
> IndexOnly .htm .html .txt .doc .pdf .xls
>
> IndexContents TXT* .txt .xls
> # Otherwise, use the HTML parser
> DefaultContents HTML*
> # I have only added the FileFilter options today ie Friday, ie to web_2.conf
> FileFilter .pdf pdftotext "'%p' -"
> FileFilter .doc catdoc "-s8859-1 -d8859-1 %p"
> FileFilter .xls xls2csv "-s8859-1 -d8859-1 %p"
you probably don't want to use FileFilter with the spider.pl script. See
http://swish-e.org/docs/spider.html#filter_content
is it possible you neglected to paste part of your spider.config below?
according to the example in the docs, you seem to be missing this line:
my ($filter_sub, $response_sub ) = swish_filter();
> spider.config contents:
> (at)not-real.servers = (
> {
> base_url => 'http://localhost:104/_docs/test3/',
> #base_url => 'http://localhost:104/_docs/test3/Reception-duties.doc',
> email => 'swish(at)not-real.user.failed.to.set.email.invalid',
> link_tags => [qw/ a frame /],
> keep_alive => 1,
> test_url => sub { $_[0]->path !~
> /\.(?:gif|jpeg|png)$/i },
> test_response => $response_sub,
> use_head_requests => 1, # Due to the response sub
> filter_content => $filter_sub,
> debug => 'errors, failed, headers, info, links, redirect, skipped, url',
>
> } );
>
to skip .zip and other files, you want to modify the test_url regex above to
something like:
test_url => sub { $_[0]->path !~ /\.(?:gif|jpeg|png|zip)$/i },
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sat Mar 17 2012 - 01:40:42 GMT