Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] exclude line of text from indexing

From: Peter Karman <peter(at)>
Date: Fri Oct 09 2009 - 20:14:23 GMT
Rob Lingelbach wrote on 10/09/2009 02:23 PM:
> On Oct 9, 2009, at 4:16 PM, Peter Karman wrote:
>> Rob Lingelbach wrote on 10/09/2009 01:54 PM:
>>> I need to exclude from swish-e indexing lines such as:
>>> "Next message: <some text>"
>>> and
>>> "Previous message: <some text>"
>> config.html#obeyrobotsnoindex
> Thanks for the answer Peter, but in this case perhaps I wasn't clear:

I think the docs aren't clear, but I understood you perfectly.

If you wrap your text like:

  <!-- noindex -->
  Next message: ......
  <!-- index -->

then swish-e will ignore the parts of the file you want ignored.

You could either (a) modify all your html with a batch process or (b) 
modify and use it plus swish-e -S prog to filter your html as 
it is crawled.

Similar to, SWISH::Prog makes this easy too. Just define a 
regex or other filter to add the noindex/index comments to the $doc 
object's content.

use strict;
use SWISH::Prog;

my $program = SWISH::Prog->new(
     aggregator => 'fs',
     filter     => \&myfilter,


sub myfilter {
     my $doc = shift;

     # if your html is more complicated you might use a real
     # parser here like HTML::Parser etc to make sure you do
     # not break the DOM.
     $doc->{content} =~
       s,(Next message:.+?)\n,<!-- noindex -->$1<!-- index -->,sgi;
     return $doc;

Peter Karman  .  .  peter(at)
Users mailing list
Received on Fri Oct 9 16:14:31 2009