Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] exclude line of text from indexing

From: Rob Lingelbach <rob(at)not-real.colorist.org>
Date: Fri Oct 09 2009 - 20:32:23 GMT
It turns out that every html file
created by mailman/pipermail has the message body wrapped
in <pre>...body...</pre> tags with all the html markup outside,
so.... I'll figure something out.   Since there are so many mailing
list archives on the web and a high percentage use mailman, and
swish-e is one of only two search packages I found suitable for
my not uncommon purposes, I would have thought there would
already be a built-in option for the mailman defaults.  Yet, here
I am having worked with mailman for many years and I never
took the time to notice how the html was wrapped.  I had been
using htdig and then webglimpse previously.

so I have 3 or 4 choices of how to do it, thanks again.

(from David Norris:
swish-e -f foo.idx -w div.postbody=(search query) )

and from Peter K.:

On Oct 9, 2009, at 5:14 PM, Peter Karman wrote:

> I think the docs aren't clear, but I understood you perfectly.
>
> If you wrap your text like:
>
>  <!-- noindex -->
>  Next message: ......
>  <!-- index -->
>
> then swish-e will ignore the parts of the file you want ignored.
>
> You could either (a) modify all your html with a batch process or (b)
> modify DirTree.pl and use it plus swish-e -S prog to filter your  
> html as
> it is crawled.
>
> Similar to DirTree.pl, SWISH::Prog makes this easy too. Just define a
> regex or other filter to add the noindex/index comments to the $doc
> object's content.

--
Rob Lingelbach
rob@colorist.org

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Oct 9 16:32:58 2009