Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] exclude line of text from indexing

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Oct 09 2009 - 20:14:23 GMT
Rob Lingelbach wrote on 10/09/2009 02:23 PM:
> On Oct 9, 2009, at 4:16 PM, Peter Karman wrote:
> 
>> Rob Lingelbach wrote on 10/09/2009 01:54 PM:
>>> I need to exclude from swish-e indexing lines such as:
>>>
>>> "Next message: <some text>"
>>>
>>> and
>>>
>>> "Previous message: <some text>"
>>>
>> http://swish-e.org/devel/devel_docs/swish- 
>> config.html#obeyrobotsnoindex
> 
> Thanks for the answer Peter, but in this case perhaps I wasn't clear:

I think the docs aren't clear, but I understood you perfectly.

If you wrap your text like:

  <!-- noindex -->
  Next message: ......
  <!-- index -->

then swish-e will ignore the parts of the file you want ignored.

You could either (a) modify all your html with a batch process or (b) 
modify DirTree.pl and use it plus swish-e -S prog to filter your html as 
it is crawled.

Similar to DirTree.pl, SWISH::Prog makes this easy too. Just define a 
regex or other filter to add the noindex/index comments to the $doc 
object's content.

#!/usr/bin/perl
use strict;
use SWISH::Prog;

my $program = SWISH::Prog->new(
     aggregator => 'fs',
     filter     => \&myfilter,
);

$program->run(@ARGV);

sub myfilter {
     my $doc = shift;

     # if your html is more complicated you might use a real
     # parser here like HTML::Parser etc to make sure you do
     # not break the DOM.
     $doc->{content} =~
       s,(Next message:.+?)\n,<!-- noindex -->$1<!-- index -->,sgi;
     return $doc;
}


-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Oct 9 16:14:31 2009