Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] multiple Warnings: 'could not be encoded to charset 'ISO-8859-1'

From: at <Peter>
Date: Fri, 16 Mar 2012 20:40:39 -0500
Dr Michael Daly wrote on 3/16/12 9:03 AM:
> I am invoking indexing via
> swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf
> ********************************************************************************************************************************************************
> web_2.conf contents:
>  IndexDir spider.pl
>  SwishProgParameters /share/MD0_DATA/swish-e-files/swish-e-conf/spider.config
> 
>  IndexOnly .htm .html .txt .doc .pdf .xls
> 
>  IndexContents TXT* .txt .xls
>   # Otherwise, use the HTML parser
>   DefaultContents HTML*
> # I have only added the FileFilter options today ie Friday, ie to web_2.conf
>       	FileFilter .pdf pdftotext   "'%p' -"
> 	FileFilter .doc catdoc "-s8859-1 -d8859-1 %p"
> 	FileFilter .xls xls2csv "-s8859-1 -d8859-1 %p"

you probably don't want to use FileFilter with the spider.pl script. See
http://swish-e.org/docs/spider.html#filter_content

is it possible you neglected to paste part of your spider.config below?
according to the example in the docs, you seem to be missing this line:

 my ($filter_sub, $response_sub ) = swish_filter();


> spider.config contents:
> (at)not-real.servers = (
>     {
> 	base_url    => 'http://localhost:104/_docs/test3/',
> 	#base_url    => 'http://localhost:104/_docs/test3/Reception-duties.doc',
> 	email               => 'swish(at)not-real.user.failed.to.set.email.invalid',
>         link_tags           => [qw/ a frame /],
>         keep_alive          => 1,
>         test_url            => sub {  $_[0]->path !~
> /\.(?:gif|jpeg|png)$/i },
>         test_response       => $response_sub,
>         use_head_requests   => 1,  # Due to the response sub
>         filter_content      => $filter_sub,
> 	debug	=> 'errors, failed, headers, info, links, redirect, skipped, url',
> 
>     } );
> 

to skip .zip and other files, you want to modify the test_url regex above to
something like:

 test_url => sub {  $_[0]->path !~ /\.(?:gif|jpeg|png|zip)$/i },


-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sat Mar 17 2012 - 01:40:42 GMT