On Apr 5, 2008, at 8:40 PM, Peter Karman wrote:
> William M Conlon wrote on 4/4/08 5:09 PM:
>> I have a list of documents to be indexed. In addition to the
>> document path, the list includes other attributes that should be
>> searchable, so they need to included in the index, although they may
>> not be in the document itself.
>> My first thought was to use -S prog, with my external program reading
>> each document, generating HTML to feed swish-e, and inserting <meta
>> name="lanuage" content="english"> for each attribute into the <head>
>> section of the HTML.
> That's what I would do.
>> My second thought was that swish-e needs to accept attributes that
>> are fed to the indexer with the document, perhaps in a *NEW*
>> Attribute header, a la:
> Would require hacking the source. And not really a good change,
> imo. It means
> applying parsing and tokenization at the header-parsing stage,
> which just seems
> unnecessary, especially when the MetaName feature already supports
> HTML or XML
> tags in the document content
I took a look at the source, and while it's straightforward to
capture the meta data in extprog.c, feeding these attributes into the
parser while it's evaluating the document requires the same work as
doing it in a perl callback, where it's far easier.
OTOH, it seems that there are repeated inquiries on the list about
how to insert meta data about the document into the index. Often we
know things about the document that are not included in the document
itself, and it seems that an extension of the existing filtering
mechanism might be useful.
To me it would be ideal to be able to feed two streams into swish-e:
* one stream is the [filtered] content.
* the second stream consists of document attributes that are not
contained in the document itself.
For now, I can take these two streams and merge them before
indexing. But perhaps the distinction between information in the
document and information about the document could be worked into your
>> And my last thought was to overload the Path-Name with the attributes
>> and use ExtractPath to build metanames.
> that's do-able too. But I would still use <meta> tags myself.
Users mailing list
Received on Tue Apr 8 01:43:10 2008