Skip to main content.
home | support | download

Back to List Archive

Re: Parsing a hypermail archive to exclude headers and footers from swishdocument

From: Kissman, Paul (BLC) <Paul.Kissman(at)not-real.state.ma.us>
Date: Tue Oct 14 2003 - 20:02:25 GMT
Just wanted to thank Bill Mosely and David Norris for guiding me to the
correct approach to my hypermail indexing problem.

I am now parsing my hypermail messages using prog instead of fs.  I
modified the index_hypermail.pl script since I had already spent a fair
amount of time on the search script and had made a bunch of changes to
the template file for displaying search results, including the
DateRanges.pm module. So the out-of-the-box scripts wouldn't work for me
without undoing a lot of work.

One thing that stumped me for several days after the changeover to prog
was that I could get all my metadata and properties to be searched and
to display properly, but I kept losing the swishdescription; it wasn't
being saved in the index.  It turns out that I had a leftover IndexOnly
statement in my conf file. 

IndexOnly .html .shtml

That was killing the swishdescription.

Swish-E had thrown errors for the other conf directives that were only
appropriate for the File Access indexing method, but it didn't inform me
about this one.

Thanks again.

pjk

Paul J. Kissman
Library Information Systems Specialist
Massachusetts Board of Library Commissioners
648 Beacon St.
Boston, MA  02215
paul.kissman@state.ma.us
www.mlin.lib.ma.us or www.mlin.org
617-267-9400 / 800-952-7403 (in-state)
Fax: 617-421-9833


-----Original Message-----
From: moseley@hank.org [mailto:moseley@hank.org] 
Sent: Thursday, October 09, 2003 3:07 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Parsing a hypermail archive to exclude headers
and footers from swishdocument

On Thu, Oct 09, 2003 at 11:30:55AM -0700, Kissman, Paul (BLC) wrote:
> I have a newbie question.
> 
> I have started to create hypermail archives of our majordomo lists in
> order to be able to search them via Swish-E.  (swish-e 2.2.3)
> 
> The only thing that still makes me unhappy about the way I have the
> Swish-E index generated is that it grabs the header and footer html
from
> the hypermail message, actually everything that falls within the
<body>
> tag. So, for instance if I am searching for my name "Paul Kissman",
the
> search brings back results where the only mention of my name is in the
> footer pointing to the next or previous message, but not in the
current
> message.

Are you using the index_hypermail.pl script that comes with swish?

It does:

        last if /<!-- body="end" -->/ || /^-- $/ || /^--$/ ||
/^(_|-){40,}\s*$/;

Which on the swish-e archives leave off that data.

> The hypermail conversion assigns the following tag to the part of my
> email messages that I want to index as the swishdescription
> 
> <div class="mail">
> 	Body of message goes here.
> </div>

You are likely using a newer version of hypermail (the one on Sunsite
where swish-e is hosted was written around 1950 I think.)

> I can't figure out if there is a way to have swish-e just index this
> part of the document or not.

Use -S prog and parse the documents.  That's exactly what it's for.  If 
your version of hypermail makes good use of <div> tags then something 
like HTML::TreeBuilder can make it easy to pull out the data you need.

I also wonder if quoted text should be indexed when indexing a mail 
archive.

I find hypermail/pipermail odd in that they generate HTML output.  Seems

like for an archive you should just archive the original data (with 
attachments stripped perhaps) and then generate the HTML when viewing.  
Store the thread data in a separate file.  I suppose disk space is 
inexpensive, though.

-- 
Bill Moseley
moseley@hank.org
Received on Tue Oct 14 20:02:29 2003