XML Schema WG comments on Data Model from Sandy Gao on 2004-02-10 (public-qt-comments@w3.org from February 2004)

From: Sandy Gao <sandygao@ca.ibm.com>
Date: Tue, 10 Feb 2004 18:13:01 -0500
To: public-qt-comments@w3.org
Cc: w3c-xml-schema-ig@w3.org
Message-ID: <OFA0FBE4E4.C2E96260-ON85256E36.007E86D1-85256E36.007F8969@ca.ibm.com>
Dear colleagues:

The XML Schema Working Group reviewed the current last call draft of the
Data Model spec, with the following comments. Hope you find them helpful.

Sandy Gao, on behalf of the XML Schema WG


1. Schema-related issues
1.1 Types in data models
1.1.1 Where are they stored?
1.1.2 Light-weight PSVI
1.2 Anonymous type names
1.3 String values of elements and attributes
1.3.1 Lack of consistency
1.3.2 Lack of accuracy
1.4 [validity] = invalid on an ancestor
1.5 Imported schema
1.6 Validate vs. assess
1.7 xsi attributes
1.8 Union of list of union
1.9 Element-content whitespaces
1.10 Atomic values
1.11 Value space of xdt:untypedAtomic

2. Other technical issues
2.1 Accessing unparsed entities
2.2 Text nodes in document node
2.3 Ignored namespace information items
2.4 Order of children in element nodes
2.5 Missing constraints
2.6 "--" in comments
2.7 Errors in the big example

3. Editorial notes
3.1 Values vs. sequences
3.2 Accessors applicable to one node type
3.3 Referring to accessors and property values
3.4 Optional infoset properties
3.5 Other editorial notes


The following comments are from the XML Schema working group on the Last
Call draft of 12 November 2003 of XQuery 1.0 and XPath 2.0 Data Model. [1]
These comments are in addition to our previous comments recorded in [2]. We
remain interested in the status of those comments.

[1] http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112
[2] http://www.w3.org/XML/Group/2003/08/xmlschema-datamodel-comments


1. Schema-related issues

1.1 Types in data models

1.1.1 Where are they stored?

In various places, the draft talks about "types" and properties of these
types. (An example is in section 6.2.2 and 6.3.2, for accessor
"dm:string-value".)

It seems that the word "type" in those places refer to schema type
definitions, instead of just their "name"s. But it's not entirely clear how
such type information is available. Element/attribute nodes only have a
"type" property for the NAME of the type, but not the type itself.

It's also not clear from the draft how processors get a handle to these
type definitions (schema components). (From a separate schema loader, from
PSVI, etc.)

It "seems" that the intention is:
- Type definitions are available within DM-compliant processors.
- There is also a name-to-type mapping (including anonymous type names)
that's available in such processors.
- Such information is internal to the DM, and is not exposed to
applications that use the DM. (Which explains why there are no accessors to
expose real types.)
- Schemas (or schema components) are somehow "imported" by DM processors.
How they are imported is not defined in DM spec. Other specs or
implementations can have their own ways to implement such importing.

If the above is correct, then there should be some notes to make it clear.

1.1.2 Light-weight PSVI

If the DM is built on top of a light-weight PSVI, then how does the
"name-to-type" mapping work? For anonymous types, all the information
provided by light-weight PSVI is "this type doesn't have a name". Even if
the DM processor somehow "imported" all the type definitions, how does it
know which type definition corresponds to this anonymous type?

We came to the conclusion that processors *might* be able to map an
anonymous type to a type definition in the "imported" schema (it works by
induction):
- If [type definition anonymous] is true for the validation root, we assume
its type definition is already available to the processor.
- Assume the type definition is known for the parent element. If [type
definition anonymous] is true, then it's possible to find an
element/attribute declaration (hence the type definition) for the current
element/attribute in the type definition of the parent element. (EDC makes
it easier, but wildcards makes it harder.) (Special process is needed for
xsi attributes.)
- Assume the type definition is known for the current element/attribute. If
[member type definition anonymous] is true, then the processor can
re-validate the string value using the type definition to find out which
member type is actually used.

The above process is possible, but it's not straightforward:
- Even with EDC, marching through all particles in the parent complex type
is expensive.
- With wildcards, EDC doesn't always give the right answer. (Imagine a
sequence of a local element "ns:e" with an anonymous type followed by a
wildcard. And there is a global "ns:e" with an anonymous type. In the
instance there are 2 "ns:e" elements. What's the type for the second
"ns:e"?)
- Re-validating strings to get member type definitions is also expensive
(and redundant).

So to get the correct answer, a DM processor has to duplicate *a lot* of
the work that has already been done by the schema processor.

We want to get a clarification about whether the DM spec does expect
implementations to work in the above described way if a DM is built on top
of a light-weight PSVI. (Or there is a much easier way that we are
missing.)

Some members from the schema WG suggest that maybe DM construction should
only work with heavy-weight PSVI.


1.2 Anonymous type names

In Section 3.3.1

"If the [validity] property exists and is "valid", the type of an element
or attribute information item is represented by an expanded-QName whose
namespace and local name correspond to the first applicable items in the
following list:
* If [member type definition] exists and its {name} property is present:
  - The {target namespace} and {name} properties of the [member type
definition] property.
* If the [type definition] property exists and its {name} property is
present:
  - The {target namespace} and {name} properties of the [type definition]
property.
* If [member type definition anonymous] exists:
  - If it is false: the [member type definition namespace] and the [member
type definition name].
  - Otherwise, the namespace and local name of the appropriate anonymous
type name.
* If [type definition anonymous] exists:
  - If it is false: the [type definition namespace] and the [type
definition name]
  - Otherwise, the namespace and local name of the appropriate anonymous
type name."

It's related to a previous comment [3].

[3] http://www.w3.org/XML/Group/2003/08/xmlschema-datamodel-comments#d0e205

In the above comment, the schema WG suggested that the rules for [type
definition] should be changed to handle anonymous types. On top of that, we
believe that similar changes need to be applied to the rules for [member
type definition].

Proposed fix: consider something similar to what DOM3 Core spec adopted
[4].

[4]
http://www.w3.org/TR/2003/CR-DOM-Level-3-Core-20031107/core.html#TypeInfo


1.3 String values of elements and attributes

In section 6.2.2 and 6.3.2, for "dm:string-value".

1.3.1 Lack of consistency

There are umpteen ways to compute a string value,

- dm:string-value [5] of Element Node
- dm:string-value [6] of Attribute Node
- casting an atomic value to xs:string [7]

[5]
http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/#ElementNodeAccessors
[6]
http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/#AttributeNodeAccessor
s
[7]
http://www.w3.org/TR/2003/WD-xpath-functions-20031112/#casting-to-string

First, umpteen casting/conversion rules are unnecessary. It appears that
[5] and [6] and incomplete. [7] is close to accurate. But, may have issues.
[7] is outside our jurisdiction. Our suggestion: there should be one set of
rules and all three must point to it. Where should that conversion rule
reside?

1.3.2 Lack of accuracy

When the type is not xs:QName or xs:NOTATION, why doesn't this accessor
always return
- the concatenation of the string-values of all the text nodes among its
<b>children</b> for elements, and
- the "string-value" property for attributes?
Are there cases where this approach doesn't work?

It seems [5] and [6] are trying to recover the original string in the
instance document, but isn't it already available in the text nodes and
string-value property?

Assume the above approach doesn't work, which means we have to cast atomic
values to string, then there are further comments.

- For "simple type or complex type with simple content" cases, don't we
need to consider derived types. That is, for example, instead of saying "If
the element type is xs:anyURI", we say "If the element type is or is
derived from xs:anyURI". Currently, only types derived from "xs:string" are
considered.
- The phrases "the string", "the URI", and "the value" are used in [5] and
[6], but it's not clear what they refer to. Some atomic value available
somewhere?
- In the case for xs:anyURI. "... returns the characters of the URI".
Editorial comment: shouldn't it be "... returns a string formed from the
characters ..."
- In the case of xs:QName
  - Shouldn't xs:NOTATION be considered in the same way as xs:QName?
  - "If the value has a namespace URI, then there must be at least one
prefix mapped to that URI in the in-scope namespaces. If there is no such
prefix, an error is raised ("no prefix defined for namespace")." Is it
still an error if the default namespace is mapped to that URI?
  - "If no error occurs, returns a string with the lexical form of a
xs:QName using the prefix chosen as described above, and the local name of
the value." But in the case where the value has no namespace URI, then
there is no prefix "chosen as described above".
- It seems that [6] doesn't consider types other than those listed. It
needs a new bullet with something like "In all other cases, ..."


1.4 [validity] = invalid on an ancestor

In section 6.2.4, for "type":

"* If the [validity] property exists and is ?valid? on this element and all
of its ancestors, type is assigned as described in 3.3.1 Mapping PSVI
Additions to Types
* Otherwise, xdt:untypedAny."

And in section 6.3.4, for "type":

"* If the [validity] property does not exist on this node or any of its
ancestors, Infoset processing applies.
  ...
* If the [validity] property exists and is "valid", type is assigned as
described in 3.3.1 Mapping PSVI Additions to Types
* Otherwise, xdt:untypedAtomic."

It seems that the rules for elements and attributes are different, in the
following case:
- [validity] exists on the current element/attribute and all its ancestors,
and
- [validity] is "invalid" on one of the ancestors, and
- [validity] is "valid" on the current element/attribute.

In this case, for elements, the type is "xdt:untypedAny"; but for
attributes, the type in PSVI is used. Is there a reason for such
difference?


1.5 Imported schema

In section 2.5,

"The data model uses expanded-QNames to represent the names of named types,
which includes both the built-in types defined by [Schema Part 2] and named
user-defined types declared in a schema and imported by a stylesheet or
query."

"[Definition: An anonymous type name is an implementation defined, unique
type name provided by the processor for every anonymous type declared in an
imported schema.] "

- Does the word "import" here have the same meaning as used/defined in
section 4.2.3 of the schema structure spec?
- If not, its meaning should be made clear. If yes, why we only care about
imported schemas, but not the "importing" schema?
- The first sentence indicates that types are imported; the second
indicates that a schema is imported. Is there any difference?

Proposed fix: if there is no particular reason to include the word
"imported" here, we suggest to remove it. So the first sentence reads "...
declared in a schema." And the second reads "... declared in a schema.]"


1.6 Validate vs. assess

In section 3.3

"An instance of the data model can be constructed from a PSVI that has been
strictly, laxly, or skip validated or validated using any combination
assessment modes."

- PSVI is an augmentation to the base infoset, so I don't believe the it's
VALIDATED. (Probably GENERATED.)
- Do the phrase "strictly/laxly/skip validated" have special meanings that
are not defined in the schema spec? Are they different from
"strictly/laxly/not assessed" from the schema spec?

Proposed fix: if there is on particular reason to introduce such
difference, and more importantly, to align with the schema spec, we suggest
something along the following line:

"An instance of the data model can be constructed from a PSVI, whose
element and attribute information items have been strictly assessed, laxly
assessed, or have not been assessed."


1.7 xsi attributes

In section 6.3.4

"They will be validated appropriately by schema processors and will simply
appear as attributes of type xs:anySimpleType if they haven't been schema
validated."

- Editorial note: "They will be validated ..." and "if they haven't been
schema validated" sound like contradict to each other.
- These attributes are just ordinary attributes in schema, and they have
their declarations and types. Proper PSVI are available for them, so there
is no need to treat them differently. And of course, "xs:anySimpleType"
doesn't apply.

Proposed fix: to remove the above sentence.


1.8 Union of list of union

In section 7

"The value of a node whose type is a union type is represented by the
appropriate value for the appropriate member type of the union type. If the
member type is an atomic type, the value is represented as an atomic value
of that type. If the member type is a list type, the value is represented
as a sequence of atomic values whose type is the item type of the list
type. The union type information is lost and only the specific type of the
actual item is retained."

In the case where the member type is a list type, it's possible that the
item type of that list is another union type.

Proposed fix: change the whole list in section 7 with something like the
following:

"The value of a node is determined in the following way based on its type.
The rules apply recursively.

* If the type's [variety] is atomic, then the value is represented as an
atomic value of that type.
* If the type's [variety] is union, then the value is represented by the
appropriate value for the appropriate member type of that type. Note that
the member type's [variety] may be atomic or list. The union type
information is lost and only the specific type of the actual item is
retained.
* If the type's [variety] is list, then the value is represented by a
sequence of appropriate values based on the item type of that type. Note
that the item type's [variety] may be atomic or union."
"If the member type is a list type, the value is represented as a sequence
of atomic values whose type is the item type of the list type."


1.9 Element-content whitespaces

In section 6.7.3, for "content"

"Applications may construct text nodes in the data model to represent
insignificant white space."

In section 6.7.4, for "content"

"Construction from a PSVI is identical to construction from the Infoset."

Because construction from PSVI is identical to Infoset, processors are
still allowed to not include certain whitespaces in the text nodes.
- But there is always a possibility that DTD and XML Schema don't agree on
whether certain whitespaces are ignorable.
- It's even possible that if certain whitespaces are ignored after DTD
processing, the instance becomes schema-invalid. (Schema operates on all
characters, not only those that are not ignorable whitespaces.)

As an example to the second comment, consider

instance.xml

<!DOCTYPE E [
<!ELEMENT E (C?)>
]>
<E>   </E>

schema.xsd

<schema xmlns="http://www.w3.org/2001/XMLSchema">
  <element name="E" type="string" fixed="   "/>
</schema>

In the above example, the internal DTD declares an element E with
element-only content with an optional C element as its child. And in the
instance, the optional C doesn't appear in the content of E, and E only has
3 space characters as its children. And we have a schema, which declares
element E to have the "xs:string" type with a fixed value of 3 spaces.
Clearly, the instance is valid with respect to the schema.

Now we construct a data model from the PSVI, if we ignore those
element-content whitespaces (as allowed by 6.7.3), then we'll get an
element node for E, without any children. Now the data model doesn't seem
to be valid with respect to the schema anymore, because of the fixed value
on E.

Proposed fix: when a data model is constructed from PSVI, then all
whitespaces have to be included.


1.10 Atomic values

In section 7

"[Definition: An atomic value is a value in the value space of an atomic
type labeled with that atomic type.]"

What's the meaning of "labeled with"? It feels like DM atomic values are
composed values, each of which has 2 properties: the schema atomic value
and the schema type, and a DM atomic value is certainly different from a
schema atomic value.

"The value space of the atomic values is the union of the value spaces of
the atomic types. This value space clearly includes those atomic values
whose type is primitive, but it also includes those whose type is derived
by restriction, as derivation by restriction always limits the value
space."

Because members of "the value spaces of the atomic types" are schema
values, members of "the value space of the atomic values" are also schema
values. This would imply that DM atomic values (composed values) are not in
the value space of the atomic values (schema atomic values). This is
counter-intuitive.

Proposed fix: we think that the intention is for DM atomic values to be the
same as schema atomic values, and it's always possible to find out the type
that was used to validate/generate a certain DM atomic value. If this is
true, then the "labelled with" phrase in the definition is a bit
misleading. Maybe "labeled with that atomic type" should be removed from
the definition, and a following note/description makes it clear that
whenever an atomic value is stored in the DM, there is always a way to
discover its corresponding type that was used to validate/generate the
value.


1.11 Value space of xdt:untypedAtomic

Section 7.1.2 has a short description about xdt:untypedAtomic, but it
doesn't answer the following question:
- What is the value space of xdt:untypedAtomic?
- Does it overlap with the values spaces of schema primitive types?
- Does it contribute to "the value space of the atomic values" (from
section 7)?

Without them being answered, it's difficult to understand the meaning of
something like (from section 6.1.2, for "dm:typed-value")

"Returns dm:string-value of the node as an xdt:untypedAtomic value."

An immediate question is how to return an "xs:string" AS an
"xdt:untypedAtomic" when they don't have sub-type relation, and when
xdt:untypedAtomic doesn't have a value space.

Proposed fix: we understand that this type is magic, and it shouldn't be
considered as an ordinary type (only used as an indicator). So to avoid the
confusion, the spec needs to make it clear that it's magic, and its value
space is not known, but this type can be used to "label" atomic values.


2. Other technical issues

2.1 Accessing unparsed entities

In section 6.1

Document nodes have a [unparsed-entities] property, which is a sequence of
entities. They also have accessors "dm:unparsed-entity-system-id" and
"dm:unparsed-entity-public-id". So there are ways to query the
system/public id of a given entity. Wouldn't it be useful to have an
accessor that exposes all entities within a document node?

Propose to introduce:

dm:unparsed-entities($node as document()) as xs:string*

and optionally:

dm:unparsed-entity-system-ids($node as document()) as xs:string*
dm:unparsed-entity-public-ids($node as document()) as xs:string*

(If the above 2 are introduced, then the processor must guarantee the order
of the entities is the same for all 3 accessors.)


2.2 Text nodes in document node

In section 6.1.3, for "children" property

"For each element, processing instruction, comment, and maximal sequence of
adjacent character information items found in the [children] property, a
corresponding element, processing instruction, comment, or text node is
constructed and that sequence of nodes is used as the value of the children
property."

But character information items are not supposed to appear in the
[children] property of a document information item.

Proposed fix:
- change "comment, and maximal sequence of adjacent character information
items" to "and comment", and
- change "comment, or text node" to "and comment".


2.3 Ignored namespace information items

In section 6.2.3, for "namespaces"

"Implementations may ignore namespace information items for namespaces
which do not appear in the expanded QName of any element or attribute
information item."

- It's not clear what "any (element or attribute ...)" means? Any
element/attribute in the whole instance document? Any descendent
elements/attributes?
- It's also possible that the value of some element/attribute is of type
QName/NOTATION, and we need to retain the namespace information for the
namespace URI in such value.

Proposed fix:

Implementations may ignore namespace information items for namespaces which
- do not appear in the expanded QName of the current element information
item or any of its descendent element or attribute information items, and
- do not appear in the value of the current element information item or any
of its descendent element or attribute information items, if the type of
such value is xs:QName or xs:NOTATION.


2.4 Order of children in element nodes

In section 6.2.4, for "children"

"The order of these nodes is implementation defined."

So even if a PI appeared before another PI in the infoset, it's allowed for
them to appear in the reversed order in the data model? This is
counter-intuitive.


2.5 Missing constraints

[Section 6.3.1]

"2. If a attribute node A has a parent element E, then A must be among the
attributes of E."

In previous section, the rules go both ways. To be consistent, we need to
insert a new rule before this one.

"2. If an attribute node A is among the <b>children</b> of an element E,
then the <b>parent</b>of A <b>must</b> be E."

[Section 6.4.1]

"2. If a namespace node N has a parent element E, then N must be among the
namespaces of E."

Same comments as above.

[Section 6.7.1]

The first rule about unique identity seems to be missing. Is this intended?


2.6 "--" in comments

In section 6.6.1

"2. The string "--" must not occur within the content."

But it's possible to have "--" in the content of comments using
entity/character references.

Proposed fix: to remove this constraint.


2.15 Errors in the big example

In appendix D, I only took a quick look at the first few nodes, and spotted
some problems.

- For PI node P1, "dm:typed-value" is missing.
- For element node E1, why dm:type(E1) = xs:anyType? Shouldn't an anonymous
type name be generated?
- For comment node C1, "dm:typed-value" should have a value
- Also for comment node C1, "dm:type" is missing.

My guess is that there are more problems in the following nodes.  Was this
generated by some program or written by hand?


3. Editorial notes

3.1 Values vs. sequences

There are places in the DM draft where the relation between values and
sequences is not clear enough.

[Section 1]

"Every value in the data model is a sequence of zero or more items."
"An atomic value encapsulates an XML Schema atomic type and a corresponding
value of that type."
"A sequence cannot be a member of a sequence."

The combination of the above sentences is at worse contradictory, and at
best misleading. An atomic value is a value; a value is a sequence; a
sequence can't contain another sequence. The result is that an atomic value
can't be a member of a sequence, which clearly isn't the intention.

I don't have a better wording for the first sentence. But I think its
intention is clear from other parts of the spec, and it wouldn't be a
disaster to simply drop the first sentence.

[Section 2.2]

Related to a previous comment [7].

[7] http://www.w3.org/XML/Group/2003/08/xmlschema-datamodel-comments#d0e325

"Some accessors can accept or return sequences."

Aren't all values sequences? So "can" isn't accurate. It gives the
impression that some other accessors "can" accept or return things other
than sequences.

Suggested fix: drop the word "can".

[Section 2.2]

"The following notation is used to denote sequence values:
* V* denotes a sequence of zero or more items of type V.
* V? denotes a sequence of exactly zero or one items of type V.
* V+ denotes a sequence of one or more items of type V."

It seems to me "only" the above 3 forms are used to denote sequence values.
But from my understanding of the spec, "V" also denote a sequence (of
length 1).

Suggested fix: add the following as the first item:

"* V  denotes a sequence of one item of type V."

[Section 3]

"The data model also supports values that are not nodes. Examples of these
are atomic values, sequences of atomic values, or sequences mixing nodes
and atomic values."

Since values in the data model are all sequences, it's not appropriate to
list "atomic values" as an example.


3.2 Accessors applicable to one node type

In section 5.11

"It is defined on all seven node types, but always returns the empty
sequence for all nodes except elements."

(Where "it" refers to the "nilled" accessor.)

- The same statement is true for another 2 accessors "attributes" and
"namespaces": they only have values for elements. Why there aren't similar
notes for those 2 accessors?
- These 3 kinds of accessors only apply for elements. Then why having them
here (defined on all seven node types), instead of just having them on
element nodes?


3.3 Referring to accessors and property values

There are occurrences of phrases like "the parent of N", "among the
children of N", "the string-value of N", etc., where it's not clear whether
"parent/children/string-value" refer to a certain property of a node, or
the returned value of a certain accessor.

Either way, it needs to be clarified. If the intention is to refer to
property values, then they should be in bold font; if the intention is to
refer to accessors, then such convention should be made clear somewhere
Received on Tuesday, 10 February 2004 18:44:36 UTC