Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] How can I adjust the META names before an HTML document is indexed?

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sun Sep 09 2007 - 00:36:19 GMT
On 9/7/07 4:30 PM, Ben Ostrowsky wrote:
> I'd like to glean metadata from the documents I'm indexing.  The
> documents have a predictable format:
> 
> ...
>  <BODY BGCOLOR="#ffffff">
>    <H1>[list-name] title of message</H1>
>     <B>name of message author</B>
>     <A HREF="..."
>        TITLE="...">username at email.host
>        </A><BR>
> ...
> 
> I'd like to be able to search these documents with "swish-e -w
> authorname=foo" or "swish-e -w authoremail=bar".
> 
> At what point during the process of indexing would it be possible to
> manipulate things so that I can do this?  Can I, for example, add a
> directive somewhere saying:
> 
> @metanames{qw( msgtitle authorname )}
>   =~ /<H1>[list-name] (.*)</H1>\w+<B>(.*)</B>/g;
> 
> or something like that?
> 

If you're using the spider.pl or DirTree.pl with -S prog, then yes, you 
can filter the content with a regex and output additional <meta> tags 
with the content.

See the filter_content callback in spider.pl and (IIRC) there's 
something similar in DirTree.pl.

See also SWISH::Prog on CPAN for building your own -S prog programs.

-- 
Peter Karman  .  peter(at)not-real.peknet.com  .  http://www.peknet.com/

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sat Sep 8 20:36:22 2007