Skip to main content.
home | support | download

Back to List Archive

Re: Parsing a hypermail archive to exclude headers and footers from swishdocument

From: <moseley(at)not-real.hank.org>
Date: Thu Oct 09 2003 - 19:06:39 GMT
On Thu, Oct 09, 2003 at 11:30:55AM -0700, Kissman, Paul (BLC) wrote:
> I have a newbie question.
> 
> I have started to create hypermail archives of our majordomo lists in
> order to be able to search them via Swish-E.  (swish-e 2.2.3)
> 
> The only thing that still makes me unhappy about the way I have the
> Swish-E index generated is that it grabs the header and footer html from
> the hypermail message, actually everything that falls within the <body>
> tag. So, for instance if I am searching for my name "Paul Kissman", the
> search brings back results where the only mention of my name is in the
> footer pointing to the next or previous message, but not in the current
> message.

Are you using the index_hypermail.pl script that comes with swish?

It does:

        last if /<!-- body="end" -->/ || /^-- $/ || /^--$/ || /^(_|-){40,}\s*$/;

Which on the swish-e archives leave off that data.

> The hypermail conversion assigns the following tag to the part of my
> email messages that I want to index as the swishdescription
> 
> <div class="mail">
> 	Body of message goes here.
> </div>

You are likely using a newer version of hypermail (the one on Sunsite
where swish-e is hosted was written around 1950 I think.)

> I can't figure out if there is a way to have swish-e just index this
> part of the document or not.

Use -S prog and parse the documents.  That's exactly what it's for.  If 
your version of hypermail makes good use of <div> tags then something 
like HTML::TreeBuilder can make it easy to pull out the data you need.

I also wonder if quoted text should be indexed when indexing a mail 
archive.

I find hypermail/pipermail odd in that they generate HTML output.  Seems 
like for an archive you should just archive the original data (with 
attachments stripped perhaps) and then generate the HTML when viewing.  
Store the thread data in a separate file.  I suppose disk space is 
inexpensive, though.

-- 
Bill Moseley
moseley@hank.org
Received on Thu Oct 9 19:12:13 2003