It turns out that every html file
created by mailman/pipermail has the message body wrapped
in <pre>...body...</pre> tags with all the html markup outside,
so.... I'll figure something out. Since there are so many mailing
list archives on the web and a high percentage use mailman, and
swish-e is one of only two search packages I found suitable for
my not uncommon purposes, I would have thought there would
already be a built-in option for the mailman defaults. Yet, here
I am having worked with mailman for many years and I never
took the time to notice how the html was wrapped. I had been
using htdig and then webglimpse previously.
so I have 3 or 4 choices of how to do it, thanks again.
(from David Norris:
swish-e -f foo.idx -w div.postbody=(search query) )
and from Peter K.:
On Oct 9, 2009, at 5:14 PM, Peter Karman wrote:
> I think the docs aren't clear, but I understood you perfectly.
> If you wrap your text like:
> <!-- noindex -->
> Next message: ......
> <!-- index -->
> then swish-e will ignore the parts of the file you want ignored.
> You could either (a) modify all your html with a batch process or (b)
> modify DirTree.pl and use it plus swish-e -S prog to filter your
> html as
> it is crawled.
> Similar to DirTree.pl, SWISH::Prog makes this easy too. Just define a
> regex or other filter to add the noindex/index comments to the $doc
> object's content.
Users mailing list
Received on Fri Oct 9 16:32:58 2009