Just wanted to thank Bill Mosely and David Norris for guiding me to the
correct approach to my hypermail indexing problem.
I am now parsing my hypermail messages using prog instead of fs. I
modified the index_hypermail.pl script since I had already spent a fair
amount of time on the search script and had made a bunch of changes to
the template file for displaying search results, including the
DateRanges.pm module. So the out-of-the-box scripts wouldn't work for me
without undoing a lot of work.
One thing that stumped me for several days after the changeover to prog
was that I could get all my metadata and properties to be searched and
to display properly, but I kept losing the swishdescription; it wasn't
being saved in the index. It turns out that I had a leftover IndexOnly
statement in my conf file.
IndexOnly .html .shtml
That was killing the swishdescription.
Swish-E had thrown errors for the other conf directives that were only
appropriate for the File Access indexing method, but it didn't inform me
about this one.
Paul J. Kissman
Library Information Systems Specialist
Massachusetts Board of Library Commissioners
648 Beacon St.
Boston, MA 02215
www.mlin.lib.ma.us or www.mlin.org
617-267-9400 / 800-952-7403 (in-state)
From: firstname.lastname@example.org [mailto:email@example.com]
Sent: Thursday, October 09, 2003 3:07 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Parsing a hypermail archive to exclude headers
and footers from swishdocument
On Thu, Oct 09, 2003 at 11:30:55AM -0700, Kissman, Paul (BLC) wrote:
> I have a newbie question.
> I have started to create hypermail archives of our majordomo lists in
> order to be able to search them via Swish-E. (swish-e 2.2.3)
> The only thing that still makes me unhappy about the way I have the
> Swish-E index generated is that it grabs the header and footer html
> the hypermail message, actually everything that falls within the
> tag. So, for instance if I am searching for my name "Paul Kissman",
> search brings back results where the only mention of my name is in the
> footer pointing to the next or previous message, but not in the
Are you using the index_hypermail.pl script that comes with swish?
last if /<!-- body="end" -->/ || /^-- $/ || /^--$/ ||
Which on the swish-e archives leave off that data.
> The hypermail conversion assigns the following tag to the part of my
> email messages that I want to index as the swishdescription
> <div class="mail">
> Body of message goes here.
You are likely using a newer version of hypermail (the one on Sunsite
where swish-e is hosted was written around 1950 I think.)
> I can't figure out if there is a way to have swish-e just index this
> part of the document or not.
Use -S prog and parse the documents. That's exactly what it's for. If
your version of hypermail makes good use of <div> tags then something
like HTML::TreeBuilder can make it easy to pull out the data you need.
I also wonder if quoted text should be indexed when indexing a mail
I find hypermail/pipermail odd in that they generate HTML output. Seems
like for an archive you should just archive the original data (with
attachments stripped perhaps) and then generate the HTML when viewing.
Store the thread data in a separate file. I suppose disk space is
Received on Tue Oct 14 20:02:29 2003