Skip to main content.
home | support | download

Back to List Archive

Re: can swish-e index mail folders?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Jul 09 2005 - 06:22:26 GMT
On Fri, Jul 08, 2005 at 08:19:05PM -0700, don fong wrote:
> Bill Moseley wrote:
> >Swish cannot do this natively.
> >
> >I'll bet someone has done this, but you can likely write
> >something without much work:
> >
> >    http://search.cpan.org/modlist/Mail_and_Usenet_News/Mail
> >
> >I'd look at Mail::Box for extracting out the messages and then maybe
> >Mail::Message for looking at individual messages.
> 
> thanks for your answer.  could you be more explicit about how you're
> suggesting these modules be used with swish-e?

Write a program that reads the mailboxes and fetches the individual
messages (Mail::Box) and then parses each message (Mail::Message) into
header and body (keeping in mind a body can contain many parts,
including one or more complete email messages).  Decide what headers
you want to search on and what types of bodies you want to index and
build either a HTML or XML file with the data.  XML might seem more
natural, but using HTML allows you a bit more control over how things
get ranked (swish-e was originally for indexing HTML docs).  So, for
example, you might put the mail subject in a <title> tag to make it
rank higher.

Write the XML or HTML containing the data you want indexed from the
email messages to a file along with added headers to allow swish-e to
parse the file.  These headers tell swish the file name (or URL) to
return in search results and the length of the following message in
bytes.  See the DirTree.pl example in the distribution of how
to correctly format the message for swish.

Once that file is created it can be read by swish:

    swish-e -c swish.config -S prog -i stdin < mail.xml

http://swish-e.org/docs/swish-run.html has a short description about
using the -S prog input method with swish.

You can also pipe the output of the program directly to swish, but I
kind of like separating the process into two steps -- parsing and
formating, and then indexing.

There are other mail parsing modules that might be easier to work
with, but mail messages can be complex and I think the Mail:: modules
are good.

If you want to index attachments such as PDF or MS-Word you can use
SWISH::Filter which comes with swish-e.  Again, see DirTree.pl for an
example of you you might use that module.  IIRC, some mail clients
don't set content type correctly (or just use
application/octet-stream), so be aware of that.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Jul 8 23:22:28 2005