Rob Lingelbach wrote on 10/09/2009 02:23 PM:
> On Oct 9, 2009, at 4:16 PM, Peter Karman wrote:
>
>> Rob Lingelbach wrote on 10/09/2009 01:54 PM:
>>> I need to exclude from swish-e indexing lines such as:
>>>
>>> "Next message: <some text>"
>>>
>>> and
>>>
>>> "Previous message: <some text>"
>>>
>> http://swish-e.org/devel/devel_docs/swish-
>> config.html#obeyrobotsnoindex
>
> Thanks for the answer Peter, but in this case perhaps I wasn't clear:
I think the docs aren't clear, but I understood you perfectly.
If you wrap your text like:
<!-- noindex -->
Next message: ......
<!-- index -->
then swish-e will ignore the parts of the file you want ignored.
You could either (a) modify all your html with a batch process or (b)
modify DirTree.pl and use it plus swish-e -S prog to filter your html as
it is crawled.
Similar to DirTree.pl, SWISH::Prog makes this easy too. Just define a
regex or other filter to add the noindex/index comments to the $doc
object's content.
#!/usr/bin/perl
use strict;
use SWISH::Prog;
my $program = SWISH::Prog->new(
aggregator => 'fs',
filter => \&myfilter,
);
$program->run(@ARGV);
sub myfilter {
my $doc = shift;
# if your html is more complicated you might use a real
# parser here like HTML::Parser etc to make sure you do
# not break the DOM.
$doc->{content} =~
s,(Next message:.+?)\n,<!-- noindex -->$1<!-- index -->,sgi;
return $doc;
}
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Oct 9 16:14:31 2009