Skip to main content.
home | support | download

Back to List Archive

Re: Merged words from XML tables

From: Peter Karman <karman(at)not-real.cray.com>
Date: Wed Oct 27 2004 - 14:55:53 GMT
Bill Moseley wrote on 10/27/04 9:37 AM:

> On Tue, Oct 26, 2004 at 11:42:09PM -0700, Stein-Egil Museus wrote:
> 
>><row><entry><para>559999</para></entry><entry><para>Some text</para></entry><row>
> 
> 
> Anyone know what the XML spec says about this?  How do you know what
> are tags should split text?

sort answer: you don't.

http://www.w3.org/TR/2000/REC-xml-20001006#sec-white-space

n.b.:
<snip>
An XML processor must always pass all characters in a document that are 
not markup through to the application. A validating XML processor must 
also inform the application which of these characters constitute white 
space appearing in element content.

A special attribute named xml:space may be attached to an element to 
signal an intention that in that element, white space should be 
preserved by applications. In valid documents, this attribute, like any 
other, must be declared if it is used. When declared, it must be given 
as an enumerated type whose values are one or both of "default" and 
"preserve".
</snip>




> 
> With HTML some tags are block level and some are inline:
> 
> Libxml2 provides a way to tell the difference.
> 

and I believe Bill put that fix into what will be the 2.4.3 release, to 
increment position on HTML block elements.

but for XML, I think you have to either:

1. always bump position on a new tag, or
2. explore the 'xml:space' attribute a little more. Maybe that could be 
used in swish to indicate whether word position should be bumped or not? 
Like

XMLBumpPositionAttr 0|1

and if set to 1, bump position.

?


-- 
Peter Karman  .  http://www.cray.com/craydoc/ .  karman(at)not-real.cray.com
Received on Wed Oct 27 07:55:54 2004