Skip to main content.
home | support | download

Back to List Archive

Re: using DC.Date.modified for swishlastmodified property

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jul 21 2005 - 14:39:24 GMT
On Wed, Jul 20, 2005 at 07:34:26PM -0700, Peter Farmer wrote:
> Can anyone confirm that if I want to use an alternative HTML metadata 
> element (in this case the Dublin Core Date.modified element) as the 
> swishlastmodified property for an indexed  document (via 
> ProperyNameAlias) that the only encoding scheme that will work is 
> 'seconds since the UNIX epoch' ?

That's basically true, but it's just stored as an integer in the index
and only sometimes is converted to a string date.  So, I suppose, it
doesn't technically have to be a UNIX epoch.

But you don't really need to alias -- just create a new property.
There's nothing special about the swishlastmodified property that
cannot be applied to other properties.

> At present all the docs to be indexed contain DC date elements encoded 
> via the (default) W3C-DTF scheme. Also I dont think that the DC allows 
> any other format for Date elements . I certainly have never seen anyone 
> generating DC Date elements with Unix epoch time-stamps.

The only advantage of using numeric (or date) properties is that they
take up less space and are likely faster to sort.  If the date is
YYYY-MM-DD then it should work for sorting as a string.  Even with the
time, although I think you would need to be sure you have a
standardized timezone.

> Indexing said documents generate this error (and empty 
> swishlastmodified properties) :
> 
> Warning: EncodeProperty - Invalid char '-' found in string '2005-06-30 
> 09:39:12 +1000'
> Warning: Failed to add property 'swishlastmodified' in file 
> 'http://myserver.mydomain/mydocument'

Right, it's having a hard time reading that string as an integer.

> Is there a recommended way to extend swish-e cleanly to do the 
> conversion or do I have to modify core swish-e code to enable detection 
> of W3C-DTF date metadata  and convert it to unix epoch format ?

You mean like some kind of plug-in?  That would be nice, but there's
nothing like that now.  I suspect it would not be too hard to hack
parser.c to decode the date into an epoch, as long as you are within
the date range of the UNIX timestamp.  Obviously, it would be more
work to incorporate it into swish as a new data type.  But perhaps
not too much.

> Or would it be better idea to extend the spider to preconvert the 
> DC.Date.modified values before passing to swish-e ?

That's probably not too hard -- use something like HTML::Parser to
traverse the tree and replace the string with an epoch.  It would be
slower, of course, and it makes more sense to do it in swish where
the document is already being parsed by libxml2.

> If I do need to modify swish-e, is it this facility something that 
> would be able to be folded back in to the main code base, rather that 
> me having to maintain a forked version ?

Seems like a great feature to allow for parsing of those dates
internally.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Jul 21 07:39:26 2005