Skip to main content.
home | support | download

Back to List Archive

Re: nested properties introduce spaces

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Mar 06 2002 - 20:20:22 GMT
At 02:35 PM 03/06/02 -0500, David L Norris wrote:
>> <name>
>>    <first>bill</first><last>moseley</last>
>> <name>
>> PropertyNames name
>
>I think I would consider the above markup as a continuous string
>"billmoseley" since first and last properties have no white space
>between.  The markup itself should be invisible.

Frankly, there's probably too much guess work going on inside docprop.c, so
it could use a review.  The real problem is that white space is stripped
from properties, and I think that's a reasonable thing to do.

   <first>
        bill
   </first>

so that ends up as property "bill", which I'd argue is the correct way to
go (consider sorting or using -L for limiting ).  You wouldn't want space
to change the sort order of the property.  Leading and trailing white space
is trimmed.

So, that gets set as property "first", but also as nested property "name".

Now, later here comes:

   <last>
     moseley
   </last>

and again, it's a property with the value of "moseley".  Seems reasonable.
But now I need to combine it with the previous value for property "name".
What's the rule?

Should one document that has:

     <first>bill</first>

store the property differently than this?

   <first>
        bill
   </first>

I don't think so.

I don't gather up the properties individually -- it's a stream (SAX)
parser.  When I see a closing tag (such as </last>) I flush any text up to
that point (which might be "\n    moseley   \n") off to the indexing code,
and also off to the docproperty code.  Then that code says, "ok, I need to
add this text to this list of properties".  One might be a new property,
such as "last", or an existing property "name".

Anyway, here's how it currently works.

> cat 1.xml                               
<name>
   <first>
      bill
   </first>
   <ignore>
     ignoreword
   </ignore>
   <last>
      moseley
   </last>
</name>

> cat c
propertyNames name first last
ignoreMetaTags ignore
DefaultContents XML2


> ./swish-e -c c -i 1.xml -T properties -v0
Indexing Data Source: "File-System"
          swishdocpath: 6 (  5) S: "1.xml"
          swishdocsize: 8 (  4) N: "0000000000126"
     swishlastmodified: 9 (  4) D: "2002-03-06 11:46:06"
                  name:10 ( 12) S: "bill moseley"
                 first:11 (  4) S: "bill"
                  last:12 (  7) S: "moseley"
Indexing done!

So, in that case, it does what I'd hope.  At least for the name example.

The "solution" would be YACS (yet another configuration setting)

  AddSpaceWhenConcatProperties name

The beauty of having the source code is that this behavior can be changed
in a very short time, if needed.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Mar 6 20:20:54 2002