Skip to main content.
home | support | download

Back to List Archive

[swish-e] How can I adjust the META names before an HTML document is indexed?

From: Ben Ostrowsky <ben(at)not-real.benostrowsky.com>
Date: Fri Sep 07 2007 - 21:30:59 GMT
I'd like to glean metadata from the documents I'm indexing.  The
documents have a predictable format:

...
 <BODY BGCOLOR="#ffffff">
   <H1>[list-name] title of message</H1>
    <B>name of message author</B>
    <A HREF="..."
       TITLE="...">username at email.host
       </A><BR>
...

I'd like to be able to search these documents with "swish-e -w
authorname=foo" or "swish-e -w authoremail=bar".

At what point during the process of indexing would it be possible to
manipulate things so that I can do this?  Can I, for example, add a
directive somewhere saying:

@metanames{qw( msgtitle authorname )}
  =~ /<H1>[list-name] (.*)</H1>\w+<B>(.*)</B>/g;

or something like that?

Ben

-- 
"Don't get suckered in by the comments;
 they can be terribly misleading.
 Debug only code."  -- Dave Storer
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Sep 7 17:30:59 2007