comments on bug 6009 (first installment) from C. M. Sperberg-McQueen on 2009-04-11 (www-xml-schema-comments@w3.org from April to June 2009)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Sat, 11 Apr 2009 15:46:56 -0600
To: John Arwe <johnarwe@us.ibm.com>, www-xml-schema-comments@w3.org
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Message-Id: <8C1D1296-0D91-4715-94FB-7A104193FF05@blackmesatech.com>
In bug 6009 (http://www.w3.org/Bugs/Public/show_bug.cgi?id=6009),
on 2 September 2008, John Arwe wrote:

 > The following are passages whose interpretation I was unsure of.

Thank you for the careful reading and catalog of places where even a
technically astute reader may stumble.

The wording proposal at

   http://www.w3.org/XML/Group/2004/06/xmlschema-1/structures.b6009.html
   (member-only link)

shows, in context, the changes I propose to make on the basis of your
comments, but many of your comments may benefit from a more direct
response, so I am also sending you this email response, with a cc to
the XSD comments list.  I'll add a pointer to this email from
Bugzilla but will not include the entire text there.

I've added separator lines and numbers to help the reader navigate and
to help myself keep track, as I draft this response, of how far I have
progressed through your comment and how far I have to go.  This email,
and the current state of the proposal mentioned above, cover only the
first dozen of your two dozen points; a second installment will be
necessary to cover the rest.


---- 1 -------------------------------------------

 > 2.2.1.1 Type Definition Hierarchy

 >    "A type defined with the same constraints as its ·base type >
 >    definition·, or with more, is said to be a restriction."

 >    "A complex type definition which allows element or attribute >
 >    content in addition to that allowed by another specified type >
 >    definition is said to be an extension."

 > I can read these together to say that a single type def may be both
 > an extension and a restriction, although I know XSD syntax does not
 > allow that.  The obvious case is a "vacuous extension", i.e. one
 > that adds no new element or attribute content.  Yes?

Yes.  Note added to say this explicitly.


---- 2 -------------------------------------------

 > 2.2.1.2 Simple Type Definition

 >    "A simple type definition is a set of constraints on strings and
 >    information about the values they encode, applicable to the
 >    ·normalized value· of an attribute information item or of an
 >    element information item with no element children."

 > This appears to say mixed=yes => never a simple type def.  Yes?

It depends on what you mean by mixed content.  In common usage, it
refers to content which is a mixture of parsed character data and
child elements. In that sense, your surmise is correct: if an element
instance contains a mixture of character children and element
children, it cannot be valid against any possible simple type.

More technically, 'mixed content' is often used, in discussions of
SGML and XML DTDs, to refer to content models containing the token
"#PCDATA" -- whether they require or allow the presence of child
elements or not.  (To my surprise, I don't find it defined as a
technical term in ISO 8879, so I'm glossing it here from memory, not
from the spec.)  Certain properties of content models and of parsing
<behavior depend not on the presence of child elements in the instance
but on the presence of '#PCDATA' in the content model.  In that sense,
your surmise would be slightly askew: a DTD content model of the form
'(#PCDATA)*' might well correspond to what in XSD one would declare
as a simple type.

In XSD itself, the term 'mixed content' is used only twice, once
referring to DTDs with what I take to be the sense just given and once
generically to the possibility that the children of an element (or a
type) might be a mixture of characters (other than whitespace) and
children.  More generally, the 'mixed' attribute on source
declarations for complex type definitions corresponds to a particular
value of the corresponding component's {content type}.{variety}.  In
this context, depending on what exactly your surmise is taken to mean,
it may be taken as (a) a category error, (b) a rough approximation but
not completely correct, or (c) a simple and true statement.

(a) Category error: 'mixed' is a property of complex types (or, since
I'm being pedantic: 'mixed' is a possible value of a property of the
{content type} of a complex type definition.  Simple types have no
corresponding or analogous property, so one cannot say "a simple type
has mixed=no" any more than one can say "the transmission of a simple
type is automatic, not manual".

mixed=yes => not a simple type definition, true.  But the same is true
for mixed=no: mixed=no => not a simple type definition, since mixed
does not apply at all to simple type definitions.

(b) Rough approximation: where character data appears we may ask "are
we dealing here with a simple type or no?"  If we are in a context
where child elements are also possible in principle, then we are not
dealing with a simple type.  True enough.

But note that from a formal point of view, xs:anyType has mixed
content: it allows both child elements and character data.  And
xs:anySimpleType -- which is a simple type -- is a restriction of
xs:anyType.  Restriction never adds something that was not already
present, at least notionally, so the formal story requires us to say
that in some sense all the values and lexical representations
associated with simple types are present in xs:anyType (even if for
pragmatic reasons processors are not required to identify them to
downstream applications).  And in that sense I would be reluctant to
affirm that mixed=yes => not a simple type def.

(c) Simple truth: Complex types may have empty content, simple
content, element-only content, or mixed content.  A complex type with
simple content has an instance of a simple type as its content
(i.e. the character sequence found in the input document is a legal
lexical representation of a simple type, and maps to a value of that
type).  For some complex type T, if T.{content type}.{variety} = mixed
then T.{content type}.{variety} != simple.  If that's what "mixed=yes
=> never a simple type def" means, then he answer is "yes".

Given the complexity of the situation I do not know of a way to
address your comment in the spec text without ripping out section 2
and starting over.  That might be useful in making the text easier to
understand, but it would probably delay us by more than a day or two
and prevent XSD 1.1 from ever becoming a W3C Recommendation, so I am
loath to undertake the effort.

If there is a simple change to make here that would have made this
paragraph seem less confusing, I'll be happy to make it, but I have
not yet found one.


---- 3 -------------------------------------------

 > 2.2.2.1 Element Declaration

 >    "...by triggering identity-constraint definition ·validation·."

 > My brain thinks you are calling out 'i-c def validation' as a
 > special term, but the usual presentation evidence of that (dots on
 > either side of a link) is absent.

We don't currently define 'identity-constraint definition validation'
as a term; to try to set your brain at rest I have added a cross
reference to the section on identity-constraint definitions.


---- 4 -------------------------------------------

 > 2.2.2.2 Element Substitution Group

 >    > "...name and content of an element must correspond exactly to
 >    the element type referenced in the corresponding content model."

 > Seems to a novice reader equivalent to saying "to the governing type
 > decl".  If so, using that term _might_ be clearer even though it's a
 > forward reference. Alterntively, their equivalence could be noted if
 > it is in fact true.

Actually, this sentence is referring to XML DTDs, and is using the
term 'element type' in a way familiar to DTD-oriented people, but
perhaps less to others.  I've revised it to read:

     When XML vocabularies are defined using the document type
     definition syntax defined by [XML 1.1], a reference in a content
     model to a particular name is satisfied only by an element in the
     XMNL document whose name and content correspond exactly to those
     given in the corresponding element type definition.

         Note: The "element type" of [XML 1.1] is not quite the same as
         the ·governing type definition· as defined in this
         specification: [XML 1.1] does not distinguish between element
         declarations and types as distinct kinds of object in the way
         that this specification does; the "element type declaration"
         of [XML 1.1] specifies both the kinds of properties associated
         in this specification with element declarations and the kinds
         of properties associated here with (complex) type definitions.


---- 5 -------------------------------------------

 > 2.2.2.2 Element Substitution Group

 > "...Through the new mechanism of element substitution groups, "

 > New?  It was in 1.0.  I realize via further reading it has changed
 > (multi-head now allowed) but that seems like "improved" not "new".
 > If the attempt was to distinguish it from "substitution groups",
 > sans "element", I don't think it does so.

It seems to be hard for the spec to realize that the language it
defines is no longer the new kid on the block.  The word 'new' was
true when it was written, as part of the text of 1.0.  I've deleted it
now.


---- 6 -------------------------------------------

 > 2.2.4.2 Type Alternative

 > "A type-alternative component (type alternative for short)
 > associates..."  The parenthetical seems to be here only for this
 > component type.  Seems like it should be done consistently (all or
 > none).

I think it's motivated by the thought that a 'definition' or a
'declaration' is more clearly and obvious part of a schema than is an
'alternative'.  When we use the phrase 'type definition' instead of
'type definition component', few people outside the paper industry and
the occasional very careful logician are disappointed or confused; The
same did not seem to us to be true when we introduced the type
alternative component and the phrase 'type alternative' to refer to
such components.  Hence the careful explanation.

Of course, the language sense of the XML Schema WG is affected by our
long involvement with the material.  I doubt that we are wrong in
thinking that we need to explain that 'type alternative' is just short
for 'type alternative component'.  But are we perhaps wrong in
thinking 'type definition' is not clear to a fresh reader as shorthand
for 'type definition component'?  If you tell me we are, I'll happily
insert similar parentheticals throughout section 2.  (Well, not
happily.  But I won't complain where you can hear me.)  But I won't
take the time solely for the sake of a consistency whose value does
not seem obvious to me.


---- 7 -------------------------------------------

 > 3.3.2.1 Common Mapping Rules for Element Declarations - XML Mapping
 > Summary clause 2

 >    "2 otherwise (the <alternative> has a test) a Type Alternative
 >    with the following properties: Property {test} Value ·absent·."

 > <alternative> HAS a test, {test} value is ABSENT.  ???

I've recast the rule in an attempt to make clearer what is going on
here.  The schema author can specify the {default type definition} of
a type table in either of two ways: if the sequence of <alternative>
elements ends in an <alternative> without a 'test' attribute, that
last 'alternative' is taken as specifying the {default type
definition}: it is as if the default test were "1 eq 1".  If the final
<alternative> does have a 'test' attribute, it's taken to be a normal
alternative like the others and handled by the rule for {alternative}
immediately above the passage quoted.  In that case, the element
declarations declared type is used as the {default type definition}.

The wording quoted is correct, even if your puzzlement is
understandable.  If the final <alternative> element has no test, then
the {default type definition} is constructed from it; otherwise the
{default type definition} has nothing to do with the final
<alternative> and is constructed with an absent {test}.  The 'test'
attribute in the final <alternative> is not lost or ignored -- it
turns up as the {test} property in the last of the {alternatives}.

The rule now reads:

     {default type definition}

         Depends upon the final <alternative> element among the
         [children]. If it has no test [attribute], the final
         <alternative> maps to the {default type definition}; if it
         does have a test attribute, it is covered by the rule for
         {alternatives} and the {default type definition} is taken from
         the declared type of the Element Declaration. So the value of
         the {default type definition} is given by the appropriate
         case among the following:

         1 If the <alternative> has no test [attribute], then a Type
           Alternative corresponding to the <alternative>.

         2 otherwise (the <alternative> has a test) a Type
           Alternative with the following properties:

             Property                 Value
             {test}                   .absent.
             {type definition}        the {type definition} property
                                      of the parent Element
                                      Declaration.
             {annotation}             the empty sequence.

The only change is the insertion of the explanatory sentence "If it
has no ..."


---- 8 -------------------------------------------

 > 3.3.1 The Element Declaration Schema Component

 > FYI: The two paragraphs beginning with "Element declarations are
 > potential members of the ·substitution groups·," are pretty hard to
 > actually understand (the first more than the second, but the first
 > depends on the second so they are linked).

I've suggested we recast this:

     The {substitution group affiliations} property of an element
     declaration indicates which substitution groups, if any, it can
     potentially be a member of.  Potential membership is transitive
     but not symmetric; an element declaration is a potential member of
     any group named in its {substitution group affiliations}, and
     also of any group of which any entry in its {substitution group
     affiliations} is a potential member. Actual membership may be
     blocked by the effects of {substitution group exclusions} or
     {disallowed substitutions}, see below.


---- 9 -------------------------------------------

 > 3.3.4.3 Element Locally Valid (Element)

 > Validation Rule: Element Locally Valid (Element) clause 1

 > When D and E both have namespace values of "absent", clause 1 seems
 > to output "never valid".  Is that that intent, do I mis-read?

The Namespaces spec says (in the passage linked to by the hyperlink):

     [Definition: An expanded name is a pair consisting of a namespace
     name and a local name. ]

If we allow the namespace name to be absent (as indeed both Namespaces
and XSD do, with the phrases 'have no value' and 'have the value
.absent.', respectively), it seems inescapable at least to me that the
pair (a, .absent.) and the pair (a, .absent.) are identical.

So yes, I think you are misreading this clause.

Would it help if the clause read not

     1 D is not ·absent· and E and D have the same expanded name.

but

     1 D is not ·absent· and the expanded names of E and D match.

with 'match' being a hyperlink to the definition of 'match' for
expanded names (in section 3.9.4.1.2 Validation of Basic Terms)?  The
definition says, roughly that two expanded names match if they are the
same expanded name (and thus, by some lights, not two expanded names
at all)?  My instinct is not to change the text, since I think the
current formulation is simpler, but I can be persuaded or outvoted.

In the wording proposal, this change is marked not-status-quo to
distinguish it visually.



--- 10 -------------------------------------------

 > 3.3.5.1 Assessment Outcome (Element)

 > "...with a [schema information] property..."

 > FYI: Since I read this front to back, at this point I had not seen
 > anything to tell me that 1.1 was introducing new properties, so this
 > confused me.  It eventually became clear of course.  I wonder if a
 > link or definition is warranted for new chunks like this.

I'm not sure I understand.  Neither [validation context] nor [schema
information] are new properties introduced by XSD 1.1; both are taken
over without change from 1.0 (except that XSD 1.1 makes explicit that
the [validation root] can be an attribute, which 1.0 passes over in
silence).

The upshot is that I don't know what confused you here and can't
attempt to fix it.



--- 11 -------------------------------------------

 > 3.3.5.2 Validation Failure (Element)

 > FYI: By this point, I figured out that you were defining new PSVI in
 > some of the []'s since I saw the definition before the usage.

 > [schema error code] got me to asking questions about its type
 > (string? qname?)  that I realize now I never asked about the PSVI
 > properties I grew up with, so I'm not sure if those questions are
 > actually fair.  It does seem that there might be some value in
 > making the error codes Qnames, to enable Schema processors invokers
 > to clearly distinguish between "official standard" error codes and
 > additional (potentially more informative) codes provided by the
 > schema processor impl.

 > I have heard folks operating in the business layer complain that
 > standard schema error messages are inadequate generally to tell a
 > user what in the instance is wrong, and therefore they use
 > Schematron etc to pre-process instances and issue more
 > domain-user-friendly messages.

At one point, the XML Schema WG intended to revamp the error codes of
the spec, which seem to some readers to have a number of shortcomings
(different readers, of course, identify different flaws, but they
don't actually cover all possible problems, they don't seem to be
orthogonal (failure to satisfy one clause of one constraint may
necessarily entail failing to satisfy a different clause of a
different predicate -- which code should be used? both?), and the idea
of ensuring that error codes can easily be hyperlinked to the relevant
rule in the spec co-exists uneasily with the claim sometimes made that
the spec is not intended to be comprehensible to naive users (only to
writers of schema processors) and the observable fact that
(independently of whether it should be or not) the spec is not written
in such a way as to make it useful to end users seeking to find and
fix problems in their data.

See bug 2843 http://www.w3.org/Bugs/Public/show_bug.cgi?id=2843
See also 2165 http://www.w3.org/Bugs/Public/show_bug.cgi?id=2165

Unfortunately, as our resources and time have grown short, it has
become clear that we do not have the capacity to perform the
front-to-back re-analysis of the spec that would be involved in
defining a new set of error codes.

You are doubtless right that Schematron's ability to customize error
messages helps make it more useful for end users; I believe that the
initial design of the XSD error code system assumed that there would
normally be some layer between the validator and the end user which
could interpret the error code and give the user a useful message.
The fact that there don't seem to be many such layers may suggest that
the current set of error codes are not structured in a way that lend
themselves to exploitation by such an intermediate layer.

On the concrete question of the type of schema error code -- like
other parts of the PSVI, the [schema error code] is an abstract label
for some bits of information.  The spec defines no types for any of
them, neither in terms of  programming-language types nor in terms of
XSD types or XML elements and attributes.

There was some interest in an API for XSD, but there was also
substantial opposition from some WG members who did not wish to see
W3C standardizing APIs ("We don' need no steeenking APIs" was the way
one WG member put it to me, privately) and some development teams
appear to have concluded that the description of the PSVI itself could
suffice as an API, although it was not designed with that in mind and
interpreting it as an API specification violates the essential premise
of calling it an "information set" rather than an "API" or "document
format".

There has also been some inhterest in XML representations of the PSVI,
but nothing remotely resembling consensus; several proposals have been
floated, and those who like one proposal generally regard the
alternative proposals as unspeakably ugly, complicated, inadequate, or
baroque, reflecting very badly on the taste or technical acumen of
their designers.  It's not the kind of reaction that encourages an
effort to get all the designers together to seek a meeting of the
minds.


--- 12 -------------------------------------------

 > 3.3.5.2 Validation Failure (Element)

 > "Note: If more than one ... fails to be satisfied," applies equally
 > well to [schema error code], no?

In principle, no.

The PSVI is an abstract account of some of the information generated
during an assessment and with the exception of properties like [failed
assertions] and [failed identity constraints] it is intended to be
invariant, or as nearly so as possible.

So [schema error code] is supposed to contain / is defined as
containing codes for every error in the element or attribute instance
it's attached to.

Some validators will expose only a subset of the PSVI, of course, but
XSD 1.1 attempts to be clear that what happens in such cases is that
the validator is exposing part of an abstract set of information which
is in principle all always present, and not (for example) that the
PSVI varies with the processor's choice of API.  (XSD 1.0 vacillates
between these views unhelpfully, particly because it keeps falling
into the error of confusing information sets with APIs.)

Of course, if you are only going to expose the [validity] and
[validation attempted] properties on the validation root, it can be a
helpful optimization to stop validating as soon as you know what those
values are going to be.  If there are three local validity errors on
the validation root, as well as an invalid descendant, you won't get
them all if you stop on the first error.  But in that case, strictly
speaking, you aren't exposing [schema erorr code], just the part of it
you calculated (and in principle at least your documentation should
say so).

The [failed assertions] and [failed identity constraints] are defined
differently, on the theory that even in the abstract a knowledge of
all assertions which fail to be true need not be part of the PSVI.

In drafting this response, I have come to believe that this is
unmotivated by any design principle and is merely a relapse into the
same mistaken view of information sets that is visible in parts of XSD
1.0.

So I have proposed to replace the Notes you refer to with different
notes that put the proposition differently.  For assertions:

     [failed assertions]

         A list of Assertions that are not satisfied by the element
         information item, as defined by Assertion Satisfied
         (§3.13.4.1).

             Note: In principle, the value of this property includes
             all of the Assertions which are not satsfied by this
             element item; in practice, some processors will choose not
             to check further identity constraints after detecting the
             first failure. Such processors will expose a subset of the
             items in this value, rather than the full value.

And analogously for identity constraints.

This is a slightly vexed question (there are reasons that the WG keeps
falling into the mistake of viewing the PSVI as an API).  So while I
will thank you for making me aware of this problem, you should be
aware that others in the WG may not thank you for bringing this topic
to the fore again.




-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************
Received on Saturday, 11 April 2009 21:47:38 UTC