Skip to main content.
home | support | download

Back to List Archive

Parsing a hypermail archive to exclude headers and footers from swishdocument

From: Kissman, Paul (BLC) <Paul.Kissman(at)not-real.state.ma.us>
Date: Thu Oct 09 2003 - 18:32:12 GMT
I have a newbie question.

I have started to create hypermail archives of our majordomo lists in
order to be able to search them via Swish-E.  (swish-e 2.2.3)

The only thing that still makes me unhappy about the way I have the
Swish-E index generated is that it grabs the header and footer html from
the hypermail message, actually everything that falls within the <body>
tag. So, for instance if I am searching for my name "Paul Kissman", the
search brings back results where the only mention of my name is in the
footer pointing to the next or previous message, but not in the current
message.

The hypermail conversion assigns the following tag to the part of my
email messages that I want to index as the swishdescription

<div class="mail">
	Body of message goes here.
</div>

I can't figure out if there is a way to have swish-e just index this
part of the document or not.

PropertyNameAlias swishdescription <div class="mail"> doesn't work (not
surprisingly)

I suppose I could have hypermail paste in some arbitrary xml tag like
<mailbody>
Around the <div class="mail"> tags. 

But since the documents coming out of hypermail are not really
well-formed xhtml, I didn't think I could use xml parsing.

Any suggestions?

Paul J. Kissman
Library Information Systems Specialist
Massachusetts Board of Library Commissioners
648 Beacon St.
Boston, MA  02215
paul.kissman@state.ma.us
www.mlin.lib.ma.us or www.mlin.org
617-267-9400 / 800-952-7403 (in-state)
Fax: 617-421-9833
Received on Thu Oct 9 18:35:52 2003