Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] How can I adjust the META names before an HTML document is indexed?

From: Peter Karman <peter(at)>
Date: Sun Sep 09 2007 - 00:36:19 GMT
On 9/7/07 4:30 PM, Ben Ostrowsky wrote:
> I'd like to glean metadata from the documents I'm indexing.  The
> documents have a predictable format:
> ...
>  <BODY BGCOLOR="#ffffff">
>    <H1>[list-name] title of message</H1>
>     <B>name of message author</B>
>     <A HREF="..."
>        TITLE="...">username at
>        </A><BR>
> ...
> I'd like to be able to search these documents with "swish-e -w
> authorname=foo" or "swish-e -w authoremail=bar".
> At what point during the process of indexing would it be possible to
> manipulate things so that I can do this?  Can I, for example, add a
> directive somewhere saying:
> @metanames{qw( msgtitle authorname )}
>   =~ /<H1>[list-name] (.*)</H1>\w+<B>(.*)</B>/g;
> or something like that?

If you're using the or with -S prog, then yes, you 
can filter the content with a regex and output additional <meta> tags 
with the content.

See the filter_content callback in and (IIRC) there's 
something similar in

See also SWISH::Prog on CPAN for building your own -S prog programs.

Peter Karman  .  peter(at)  .

Users mailing list
Received on Sat Sep 8 20:36:22 2007