RE: Issue-0079: String-value vs. string-value of the typed-value from Kay, Michael on 2002-12-10 (public-qt-comments@w3.org from December 2002)

From: Kay, Michael <Michael.Kay@softwareag.com>
Date: Wed, 11 Dec 2002 00:46:21 +0100
To: David Holmes <dholmes@tibco.com>, "'public-qt-comments@w3.org'" <public-qt-comments@w3.org>
Message-ID: <DFF2AC9E3583D511A21F0008C7E621060453DE97@daemsg02.software-ag.de>
> -----Original Message-----
> From: David Holmes [mailto:dholmes@tibco.com] 
> Sent: 10 December 2002 20:21
> To: 'public-qt-comments@w3.org'
> Subject: Re: Issue-0079: String-value vs. string-value of the 
> typed-value
> 
<snip/>
> 
> 0079: String-value vs. string-value of the typed value 
> =====================================
> 
> Before addressing the issues above, and Jeni's message, I 
> would like to identify three value spaces then walk through 
> some simple examples to see how they are related.  The value 
> spaces (meanings in parenthesis) are:
> 
> i) Lexical-value (serialized form)
> ii) String-value (xs:string that results from some 
> XPath/XQuery function).
> iii) Typed-value (atomic-value*)
> 
> The first case I will consider is an XML source without type 
> information (i.e. parsed without a schema).  In this case, 
> attribute values will have the type annotation 
> xs:anySimpleType and are essentially opaque.  The string 
> value is the same as the lexical value.

OK. <snip/>
> 
> Now consider the case of an XML source with type information 
> and what happens to an xs:decimal("01") and an 
> xs;QName("foo:bar") (forgive the constructor hand-waving and 
> assume that both types are fully parsed and expanded 
> respectively).  The lexical values are clearly not open to 
> dispute being "01" and "foo:bar".  The typed values are 1 and 
> the uri/local-name pair (uri corresponding to foo, baz).  The 
> string value (as defined by XPath
> 1.0) is "1" for the xs:decimal and undefined for the 
> xs:QName.  

Did you mean "as defined by XPath 2.0"? In 1.0, clearly, the string values
are "01" and "foo:bar". In 2.0, as currently defined, they are the same, but
this is the subject of issue 079.

This is obviously a rather controversial issue, there has been a lot of
discontent expressed on the xsl-list with the suggestion that information
(such as trailing zeros) might be lost as a result of data typing. Some
people see data typing and schemas as purely a validation mechanism, and
resent any idea that a schema should cause the value as written to be
canonicalized in some way. I'm not saying these people are right or wrong,
just that there's definitely an open issue here and whatever we decide is
going to leave some people dissatisfied.

> The point I would like to make here is that 
> string-value is only tenuously connected to the lexical value

That's one point of view: it tends to be the "data" point of view, whereas
the "document" point of view leans the other way.

<snip/>
 
> First, using the example of an 
> xs:double (and to address one of Jeni's
> points) what is the string value of the following content?:
> 
> <value xsi:type="xs:double">32</value>
> 
> Unfortunately for Jeni, XPath 2.0 defines it to be "3.2E1".

No, the current definition in the XPath data model is that the string value
of an element node is the concatenation of the contents of the text nodes.
 
<snip/>
> 
> Addressing another of Jeni's examples - the case of mixed content:
> 
> <foo xsi:type="fooType">
>    blah <bar>blah</bar> blah
> </foo>
> 
> The string value of the foo element *can* be obtained from 
> it's typed value
> - it's not an error. Here I'll assert that the typed value of 
> <foo> is the sequence obtained by concatonating the typed 
> value of all the child nodes (atomization of the child node 
> sequence).

Again, that's one possible definition of the typed value, but it's not the
one in the current data model. You hit a problem here if the tree is more
than two levels deep, because sequences can't contain sequences. Do you want
the tree to be flattened, losing the hierarchic structure? This seems
dubious.

<snip/>
> 
> I would also suggest that text nodes be allowed to return a 
> type other than xs:anySimpleType.  The normalization rules 
> for text nodes make it possible to think of text nodes as 
> typed values which are bound to elements or documents.

Text nodes appear either in simple content or in mixed content. In simple
content you never need concern yourself with text nodes, because you can use
the typed value of the containing element: unless you are actually
interested in embedded comments and processing instructions. In mixed
content, text nodes clearly can't have any sensible type other than string.
So I'm not sure what you're trying to achieve. 
 
> However, text nodes don't have type annotations.  I strongly 
> believe that allowing text nodes to return typed-values other 
> than simple types will be key to making the Data Model spec 
> feasible to implement. Without it, a data model would have to 
> perform endless calculations and caching as the user 
> alternates between the typed view and the text-node view.  I 
> would also almost go so far as to say that the 
> dm:string-value accessor on the Data Model is redundant 
> because it can be calculated from the typed value, but for 
> performance reasons with untyped data it probably should be retained.

In XSLT, we get enough complaints from people that we throw away entity
references and CDATA sections when they try to do an identity
transformation. If we convert 1234.00 to 1234.0 just because the schema says
the type is xs:decimal, we are certainly going to have some unhappy users,
and quite possibly some failing applications. In XQuery too, the feedback
from our (Tamino) users is that they want to retain a high degree of
fidelity in the documents they store: if they wanted to store conventional
structured data, they wouldn't be using an XML database. So my own
preference is that we should keep the lexical value and treat the typed
value as being derived from it.
> 
> <xsl:value-of> and <xsl:text> don't create strings! 
> ====================================
> 
> Having these instructions generate strings as opposed to 
> typed values makes it impossible to correctly emit an 
> xs:QName.

This is currently true, though I think we can probably find a solution to
this requirement.

> The correct approach here is for these 
> instructions to first create a sequence through their 
> appropriate sequence constructors (select attribute or 
> "content-constructor") which is then atomized to create a 
> typed value.

This debate was one the working group (or rather, the two working groups
together) had in great detail. It's difficult to summarize the debate and
the reasons for choosing what was dubbed "Alternative 1" in a few lines.
Alternative 1 was that tree construction should always construct lexical
values, and type annotations should only be restored by applying schema
validation (this is the current situation in XQuery, though XSLT has relaxed
this to allow simple type annotations to be specified at node construction
time). Alternative 3 allowed an arbitrary sequence to be written to a node,
including heterogeneous sequences such as a sequence of arbitrary strings,
which in general cannot be serialized and reparsed losslessly. Alternative 2
had various flavours which attempted to identify cases where the type of the
sequence could be retained without creating instances of the data model that
could never arise from parsing real XML. None of the flavours of Alternative
2 worked, so we ended up (with considerable reluctance) adopting alternative
1.


> That typed value then creates a text node or a 
> series of atomic values depending upon the output context 
> (for XSLT this context is usually a Data Model node, but the 
> creation of sequences is possible).  These instructions are 
> also converging and probably should share the same syntax 
> (excepting the actual tag name. Both should have optional 
> select expressions or sequence(content) constructors like 
> xsl:variable.  

I don't see that much would be gained by merging xsl:value-of and xsl:text
at this stage of the game, though a single instruction that served the
purposes of both might have been a good idea originally.


> Implementations must deal with separators 
> carefully but the approach is conceptually simple - rendering 
> should be delayed as long as possible because serialization 
> of an atomic value is, in general, context dependent.

What makes it non-simple is that many sequence values that we allow in our
data model could never be obtained by parsing an actual XML document and
validating it with a schema. We made the decision, which I think is the
right one, that it should not be possible to construct such trees
programmatically. This ensures that the data model is round-trippable to
serialized XML.

> In 
> general I believe that conceptual simplifications occur in 
> the XSLT specification if the Data Model concept of sequences 
> and typed-values are embraced as being the first-class 
> citizens and intermediate string values and fragments are 
> shunned as being a bottleneck for type-enabling XSLT.  I also 
> believe that this can be done in a way that minimises 
> backwards compatibility issues (even though these are a 
> *should* in the RFC) and above all in a way that makes sense 
> of the string-value concept.
> 

I think it's entirely possible that the language could be described
consistently in the way you advocate. It certainly isn't obvious that this
would make it simpler, and it would certainly take it further away from the
expectations of many document-oriented XML and XSLT users, who are currently
generating a lot of noise trying to push us in the opposite direction (that
is, returning to the roots of XML as a language for annotating text using
markup).

Thanks for your comments. There are no easy answers!

Michael Kay
Software AG
Received on Tuesday, 10 December 2002 18:46:37 UTC