Re: [swish-e] exclude line of text from indexing

From: Rob Lingelbach <rob(at)>
Date: Fri Oct 09 2009 - 20:32:23 GMT
It turns out that every html file
created by mailman/pipermail has the message body wrapped
in <pre>...body...</pre> tags with all the html markup outside,
so.... I'll figure something out.   Since there are so many mailing
list archives on the web and a high percentage use mailman, and
swish-e is one of only two search packages I found suitable for
my not uncommon purposes, I would have thought there would
already be a built-in option for the mailman defaults.  Yet, here
I am having worked with mailman for many years and I never
took the time to notice how the html was wrapped.  I had been
using htdig and then webglimpse previously.

so I have 3 or 4 choices of how to do it, thanks again.

(from David Norris:
swish-e -f foo.idx -w div.postbody=(search query) )

and from Peter K.:

On Oct 9, 2009, at 5:14 PM, Peter Karman wrote:

> I think the docs aren't clear, but I understood you perfectly.
> If you wrap your text like:
>  <!-- noindex -->
>  Next message: ......
>  <!-- index -->
> then swish-e will ignore the parts of the file you want ignored.
> You could either (a) modify all your html with a batch process or (b)
> modify and use it plus swish-e -S prog to filter your  
> html as
> it is crawled.
> Similar to, SWISH::Prog makes this easy too. Just define a
> regex or other filter to add the noindex/index comments to the $doc
> object's content.

Rob Lingelbach

