William M Conlon wrote on 4/8/08 9:36 PM:
> On Apr 8, 2008, at 7:11 PM, Peter Karman wrote:
>>> OTOH, it seems that there are repeated inquiries on the list about
>>> how to insert meta data about the document into the index. Often we
>>> know things about the document that are not included in the document
>>> itself, and it seems that an extension of the existing filtering
>>> mechanism might be useful.
>> see URL above. That version of SWISH::Filter needs to get merged
>> back into the
>> Swish-e dist. It definitely will in 2.6; not sure if it will in 2.4.x.
> hmm. I've just about finished hacking spider.pl to add another user-
> defined callback function to allow me to insert the additional
> attributes into ALL documents, including the TEXT/HTML types that are
> normally not filtered.
yes, doing it at the aggregator level is much better than using SWISH::Filter. I
just referenced that feature in SWISH::Filter to show that "the existing
filtering mechanism" already had what you were asking about.
But you want non-filtered docs (html, txt, xml) to get the metadata too. So
hacking spider.pl is better.
fwiw, SWISH::Prog makes this easy. http://svn.swish-e.org/perl/SWISH-Prog/trunk/
That should be making its way to CPAN in the next few days I hope.
> But it looks like the meta_data() method would allow me to instead
> build a filter that inserts the attributes as meta data. I take it
> need to update the filters (such as pdf2html) to use set_continue, so
> that after type conversion, my attribute_insertion filter gets called?
You could use SWISH::Filter and write a AddMetadata filter I guess. Yes, you'd
need to set set_continue() to true to get the chaining effect for existing
filters. If it were me, I'd be doing it in the aggregator (spider.pl e.g.)
instead though, since then you could add the metadata just before you print() to
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Users mailing list
Received on Tue Apr 8 22:57:06 2008