- From: Sandy Gao <sandygao@ca.ibm.com>
- Date: Tue, 10 Feb 2004 18:13:01 -0500
- To: public-qt-comments@w3.org
- Cc: w3c-xml-schema-ig@w3.org
Dear colleagues: The XML Schema Working Group reviewed the current last call draft of the Data Model spec, with the following comments. Hope you find them helpful. Sandy Gao, on behalf of the XML Schema WG 1. Schema-related issues 1.1 Types in data models 1.1.1 Where are they stored? 1.1.2 Light-weight PSVI 1.2 Anonymous type names 1.3 String values of elements and attributes 1.3.1 Lack of consistency 1.3.2 Lack of accuracy 1.4 [validity] = invalid on an ancestor 1.5 Imported schema 1.6 Validate vs. assess 1.7 xsi attributes 1.8 Union of list of union 1.9 Element-content whitespaces 1.10 Atomic values 1.11 Value space of xdt:untypedAtomic 2. Other technical issues 2.1 Accessing unparsed entities 2.2 Text nodes in document node 2.3 Ignored namespace information items 2.4 Order of children in element nodes 2.5 Missing constraints 2.6 "--" in comments 2.7 Errors in the big example 3. Editorial notes 3.1 Values vs. sequences 3.2 Accessors applicable to one node type 3.3 Referring to accessors and property values 3.4 Optional infoset properties 3.5 Other editorial notes The following comments are from the XML Schema working group on the Last Call draft of 12 November 2003 of XQuery 1.0 and XPath 2.0 Data Model. [1] These comments are in addition to our previous comments recorded in [2]. We remain interested in the status of those comments. [1] http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112 [2] http://www.w3.org/XML/Group/2003/08/xmlschema-datamodel-comments 1. Schema-related issues 1.1 Types in data models 1.1.1 Where are they stored? In various places, the draft talks about "types" and properties of these types. (An example is in section 6.2.2 and 6.3.2, for accessor "dm:string-value".) It seems that the word "type" in those places refer to schema type definitions, instead of just their "name"s. But it's not entirely clear how such type information is available. Element/attribute nodes only have a "type" property for the NAME of the type, but not the type itself. It's also not clear from the draft how processors get a handle to these type definitions (schema components). (From a separate schema loader, from PSVI, etc.) It "seems" that the intention is: - Type definitions are available within DM-compliant processors. - There is also a name-to-type mapping (including anonymous type names) that's available in such processors. - Such information is internal to the DM, and is not exposed to applications that use the DM. (Which explains why there are no accessors to expose real types.) - Schemas (or schema components) are somehow "imported" by DM processors. How they are imported is not defined in DM spec. Other specs or implementations can have their own ways to implement such importing. If the above is correct, then there should be some notes to make it clear. 1.1.2 Light-weight PSVI If the DM is built on top of a light-weight PSVI, then how does the "name-to-type" mapping work? For anonymous types, all the information provided by light-weight PSVI is "this type doesn't have a name". Even if the DM processor somehow "imported" all the type definitions, how does it know which type definition corresponds to this anonymous type? We came to the conclusion that processors *might* be able to map an anonymous type to a type definition in the "imported" schema (it works by induction): - If [type definition anonymous] is true for the validation root, we assume its type definition is already available to the processor. - Assume the type definition is known for the parent element. If [type definition anonymous] is true, then it's possible to find an element/attribute declaration (hence the type definition) for the current element/attribute in the type definition of the parent element. (EDC makes it easier, but wildcards makes it harder.) (Special process is needed for xsi attributes.) - Assume the type definition is known for the current element/attribute. If [member type definition anonymous] is true, then the processor can re-validate the string value using the type definition to find out which member type is actually used. The above process is possible, but it's not straightforward: - Even with EDC, marching through all particles in the parent complex type is expensive. - With wildcards, EDC doesn't always give the right answer. (Imagine a sequence of a local element "ns:e" with an anonymous type followed by a wildcard. And there is a global "ns:e" with an anonymous type. In the instance there are 2 "ns:e" elements. What's the type for the second "ns:e"?) - Re-validating strings to get member type definitions is also expensive (and redundant). So to get the correct answer, a DM processor has to duplicate *a lot* of the work that has already been done by the schema processor. We want to get a clarification about whether the DM spec does expect implementations to work in the above described way if a DM is built on top of a light-weight PSVI. (Or there is a much easier way that we are missing.) Some members from the schema WG suggest that maybe DM construction should only work with heavy-weight PSVI. 1.2 Anonymous type names In Section 3.3.1 "If the [validity] property exists and is "valid", the type of an element or attribute information item is represented by an expanded-QName whose namespace and local name correspond to the first applicable items in the following list: * If [member type definition] exists and its {name} property is present: - The {target namespace} and {name} properties of the [member type definition] property. * If the [type definition] property exists and its {name} property is present: - The {target namespace} and {name} properties of the [type definition] property. * If [member type definition anonymous] exists: - If it is false: the [member type definition namespace] and the [member type definition name]. - Otherwise, the namespace and local name of the appropriate anonymous type name. * If [type definition anonymous] exists: - If it is false: the [type definition namespace] and the [type definition name] - Otherwise, the namespace and local name of the appropriate anonymous type name." It's related to a previous comment [3]. [3] http://www.w3.org/XML/Group/2003/08/xmlschema-datamodel-comments#d0e205 In the above comment, the schema WG suggested that the rules for [type definition] should be changed to handle anonymous types. On top of that, we believe that similar changes need to be applied to the rules for [member type definition]. Proposed fix: consider something similar to what DOM3 Core spec adopted [4]. [4] http://www.w3.org/TR/2003/CR-DOM-Level-3-Core-20031107/core.html#TypeInfo 1.3 String values of elements and attributes In section 6.2.2 and 6.3.2, for "dm:string-value". 1.3.1 Lack of consistency There are umpteen ways to compute a string value, - dm:string-value [5] of Element Node - dm:string-value [6] of Attribute Node - casting an atomic value to xs:string [7] [5] http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/#ElementNodeAccessors [6] http://www.w3.org/TR/2003/WD-xpath-datamodel-20031112/#AttributeNodeAccessor s [7] http://www.w3.org/TR/2003/WD-xpath-functions-20031112/#casting-to-string First, umpteen casting/conversion rules are unnecessary. It appears that [5] and [6] and incomplete. [7] is close to accurate. But, may have issues. [7] is outside our jurisdiction. Our suggestion: there should be one set of rules and all three must point to it. Where should that conversion rule reside? 1.3.2 Lack of accuracy When the type is not xs:QName or xs:NOTATION, why doesn't this accessor always return - the concatenation of the string-values of all the text nodes among its <b>children</b> for elements, and - the "string-value" property for attributes? Are there cases where this approach doesn't work? It seems [5] and [6] are trying to recover the original string in the instance document, but isn't it already available in the text nodes and string-value property? Assume the above approach doesn't work, which means we have to cast atomic values to string, then there are further comments. - For "simple type or complex type with simple content" cases, don't we need to consider derived types. That is, for example, instead of saying "If the element type is xs:anyURI", we say "If the element type is or is derived from xs:anyURI". Currently, only types derived from "xs:string" are considered. - The phrases "the string", "the URI", and "the value" are used in [5] and [6], but it's not clear what they refer to. Some atomic value available somewhere? - In the case for xs:anyURI. "... returns the characters of the URI". Editorial comment: shouldn't it be "... returns a string formed from the characters ..." - In the case of xs:QName - Shouldn't xs:NOTATION be considered in the same way as xs:QName? - "If the value has a namespace URI, then there must be at least one prefix mapped to that URI in the in-scope namespaces. If there is no such prefix, an error is raised ("no prefix defined for namespace")." Is it still an error if the default namespace is mapped to that URI? - "If no error occurs, returns a string with the lexical form of a xs:QName using the prefix chosen as described above, and the local name of the value." But in the case where the value has no namespace URI, then there is no prefix "chosen as described above". - It seems that [6] doesn't consider types other than those listed. It needs a new bullet with something like "In all other cases, ..." 1.4 [validity] = invalid on an ancestor In section 6.2.4, for "type": "* If the [validity] property exists and is ?valid? on this element and all of its ancestors, type is assigned as described in 3.3.1 Mapping PSVI Additions to Types * Otherwise, xdt:untypedAny." And in section 6.3.4, for "type": "* If the [validity] property does not exist on this node or any of its ancestors, Infoset processing applies. ... * If the [validity] property exists and is "valid", type is assigned as described in 3.3.1 Mapping PSVI Additions to Types * Otherwise, xdt:untypedAtomic." It seems that the rules for elements and attributes are different, in the following case: - [validity] exists on the current element/attribute and all its ancestors, and - [validity] is "invalid" on one of the ancestors, and - [validity] is "valid" on the current element/attribute. In this case, for elements, the type is "xdt:untypedAny"; but for attributes, the type in PSVI is used. Is there a reason for such difference? 1.5 Imported schema In section 2.5, "The data model uses expanded-QNames to represent the names of named types, which includes both the built-in types defined by [Schema Part 2] and named user-defined types declared in a schema and imported by a stylesheet or query." "[Definition: An anonymous type name is an implementation defined, unique type name provided by the processor for every anonymous type declared in an imported schema.] " - Does the word "import" here have the same meaning as used/defined in section 4.2.3 of the schema structure spec? - If not, its meaning should be made clear. If yes, why we only care about imported schemas, but not the "importing" schema? - The first sentence indicates that types are imported; the second indicates that a schema is imported. Is there any difference? Proposed fix: if there is no particular reason to include the word "imported" here, we suggest to remove it. So the first sentence reads "... declared in a schema." And the second reads "... declared in a schema.]" 1.6 Validate vs. assess In section 3.3 "An instance of the data model can be constructed from a PSVI that has been strictly, laxly, or skip validated or validated using any combination assessment modes." - PSVI is an augmentation to the base infoset, so I don't believe the it's VALIDATED. (Probably GENERATED.) - Do the phrase "strictly/laxly/skip validated" have special meanings that are not defined in the schema spec? Are they different from "strictly/laxly/not assessed" from the schema spec? Proposed fix: if there is on particular reason to introduce such difference, and more importantly, to align with the schema spec, we suggest something along the following line: "An instance of the data model can be constructed from a PSVI, whose element and attribute information items have been strictly assessed, laxly assessed, or have not been assessed." 1.7 xsi attributes In section 6.3.4 "They will be validated appropriately by schema processors and will simply appear as attributes of type xs:anySimpleType if they haven't been schema validated." - Editorial note: "They will be validated ..." and "if they haven't been schema validated" sound like contradict to each other. - These attributes are just ordinary attributes in schema, and they have their declarations and types. Proper PSVI are available for them, so there is no need to treat them differently. And of course, "xs:anySimpleType" doesn't apply. Proposed fix: to remove the above sentence. 1.8 Union of list of union In section 7 "The value of a node whose type is a union type is represented by the appropriate value for the appropriate member type of the union type. If the member type is an atomic type, the value is represented as an atomic value of that type. If the member type is a list type, the value is represented as a sequence of atomic values whose type is the item type of the list type. The union type information is lost and only the specific type of the actual item is retained." In the case where the member type is a list type, it's possible that the item type of that list is another union type. Proposed fix: change the whole list in section 7 with something like the following: "The value of a node is determined in the following way based on its type. The rules apply recursively. * If the type's [variety] is atomic, then the value is represented as an atomic value of that type. * If the type's [variety] is union, then the value is represented by the appropriate value for the appropriate member type of that type. Note that the member type's [variety] may be atomic or list. The union type information is lost and only the specific type of the actual item is retained. * If the type's [variety] is list, then the value is represented by a sequence of appropriate values based on the item type of that type. Note that the item type's [variety] may be atomic or union." "If the member type is a list type, the value is represented as a sequence of atomic values whose type is the item type of the list type." 1.9 Element-content whitespaces In section 6.7.3, for "content" "Applications may construct text nodes in the data model to represent insignificant white space." In section 6.7.4, for "content" "Construction from a PSVI is identical to construction from the Infoset." Because construction from PSVI is identical to Infoset, processors are still allowed to not include certain whitespaces in the text nodes. - But there is always a possibility that DTD and XML Schema don't agree on whether certain whitespaces are ignorable. - It's even possible that if certain whitespaces are ignored after DTD processing, the instance becomes schema-invalid. (Schema operates on all characters, not only those that are not ignorable whitespaces.) As an example to the second comment, consider instance.xml <!DOCTYPE E [ <!ELEMENT E (C?)> ]> <E> </E> schema.xsd <schema xmlns="http://www.w3.org/2001/XMLSchema"> <element name="E" type="string" fixed=" "/> </schema> In the above example, the internal DTD declares an element E with element-only content with an optional C element as its child. And in the instance, the optional C doesn't appear in the content of E, and E only has 3 space characters as its children. And we have a schema, which declares element E to have the "xs:string" type with a fixed value of 3 spaces. Clearly, the instance is valid with respect to the schema. Now we construct a data model from the PSVI, if we ignore those element-content whitespaces (as allowed by 6.7.3), then we'll get an element node for E, without any children. Now the data model doesn't seem to be valid with respect to the schema anymore, because of the fixed value on E. Proposed fix: when a data model is constructed from PSVI, then all whitespaces have to be included. 1.10 Atomic values In section 7 "[Definition: An atomic value is a value in the value space of an atomic type labeled with that atomic type.]" What's the meaning of "labeled with"? It feels like DM atomic values are composed values, each of which has 2 properties: the schema atomic value and the schema type, and a DM atomic value is certainly different from a schema atomic value. "The value space of the atomic values is the union of the value spaces of the atomic types. This value space clearly includes those atomic values whose type is primitive, but it also includes those whose type is derived by restriction, as derivation by restriction always limits the value space." Because members of "the value spaces of the atomic types" are schema values, members of "the value space of the atomic values" are also schema values. This would imply that DM atomic values (composed values) are not in the value space of the atomic values (schema atomic values). This is counter-intuitive. Proposed fix: we think that the intention is for DM atomic values to be the same as schema atomic values, and it's always possible to find out the type that was used to validate/generate a certain DM atomic value. If this is true, then the "labelled with" phrase in the definition is a bit misleading. Maybe "labeled with that atomic type" should be removed from the definition, and a following note/description makes it clear that whenever an atomic value is stored in the DM, there is always a way to discover its corresponding type that was used to validate/generate the value. 1.11 Value space of xdt:untypedAtomic Section 7.1.2 has a short description about xdt:untypedAtomic, but it doesn't answer the following question: - What is the value space of xdt:untypedAtomic? - Does it overlap with the values spaces of schema primitive types? - Does it contribute to "the value space of the atomic values" (from section 7)? Without them being answered, it's difficult to understand the meaning of something like (from section 6.1.2, for "dm:typed-value") "Returns dm:string-value of the node as an xdt:untypedAtomic value." An immediate question is how to return an "xs:string" AS an "xdt:untypedAtomic" when they don't have sub-type relation, and when xdt:untypedAtomic doesn't have a value space. Proposed fix: we understand that this type is magic, and it shouldn't be considered as an ordinary type (only used as an indicator). So to avoid the confusion, the spec needs to make it clear that it's magic, and its value space is not known, but this type can be used to "label" atomic values. 2. Other technical issues 2.1 Accessing unparsed entities In section 6.1 Document nodes have a [unparsed-entities] property, which is a sequence of entities. They also have accessors "dm:unparsed-entity-system-id" and "dm:unparsed-entity-public-id". So there are ways to query the system/public id of a given entity. Wouldn't it be useful to have an accessor that exposes all entities within a document node? Propose to introduce: dm:unparsed-entities($node as document()) as xs:string* and optionally: dm:unparsed-entity-system-ids($node as document()) as xs:string* dm:unparsed-entity-public-ids($node as document()) as xs:string* (If the above 2 are introduced, then the processor must guarantee the order of the entities is the same for all 3 accessors.) 2.2 Text nodes in document node In section 6.1.3, for "children" property "For each element, processing instruction, comment, and maximal sequence of adjacent character information items found in the [children] property, a corresponding element, processing instruction, comment, or text node is constructed and that sequence of nodes is used as the value of the children property." But character information items are not supposed to appear in the [children] property of a document information item. Proposed fix: - change "comment, and maximal sequence of adjacent character information items" to "and comment", and - change "comment, or text node" to "and comment". 2.3 Ignored namespace information items In section 6.2.3, for "namespaces" "Implementations may ignore namespace information items for namespaces which do not appear in the expanded QName of any element or attribute information item." - It's not clear what "any (element or attribute ...)" means? Any element/attribute in the whole instance document? Any descendent elements/attributes? - It's also possible that the value of some element/attribute is of type QName/NOTATION, and we need to retain the namespace information for the namespace URI in such value. Proposed fix: Implementations may ignore namespace information items for namespaces which - do not appear in the expanded QName of the current element information item or any of its descendent element or attribute information items, and - do not appear in the value of the current element information item or any of its descendent element or attribute information items, if the type of such value is xs:QName or xs:NOTATION. 2.4 Order of children in element nodes In section 6.2.4, for "children" "The order of these nodes is implementation defined." So even if a PI appeared before another PI in the infoset, it's allowed for them to appear in the reversed order in the data model? This is counter-intuitive. 2.5 Missing constraints [Section 6.3.1] "2. If a attribute node A has a parent element E, then A must be among the attributes of E." In previous section, the rules go both ways. To be consistent, we need to insert a new rule before this one. "2. If an attribute node A is among the <b>children</b> of an element E, then the <b>parent</b>of A <b>must</b> be E." [Section 6.4.1] "2. If a namespace node N has a parent element E, then N must be among the namespaces of E." Same comments as above. [Section 6.7.1] The first rule about unique identity seems to be missing. Is this intended? 2.6 "--" in comments In section 6.6.1 "2. The string "--" must not occur within the content." But it's possible to have "--" in the content of comments using entity/character references. Proposed fix: to remove this constraint. 2.15 Errors in the big example In appendix D, I only took a quick look at the first few nodes, and spotted some problems. - For PI node P1, "dm:typed-value" is missing. - For element node E1, why dm:type(E1) = xs:anyType? Shouldn't an anonymous type name be generated? - For comment node C1, "dm:typed-value" should have a value - Also for comment node C1, "dm:type" is missing. My guess is that there are more problems in the following nodes. Was this generated by some program or written by hand? 3. Editorial notes 3.1 Values vs. sequences There are places in the DM draft where the relation between values and sequences is not clear enough. [Section 1] "Every value in the data model is a sequence of zero or more items." "An atomic value encapsulates an XML Schema atomic type and a corresponding value of that type." "A sequence cannot be a member of a sequence." The combination of the above sentences is at worse contradictory, and at best misleading. An atomic value is a value; a value is a sequence; a sequence can't contain another sequence. The result is that an atomic value can't be a member of a sequence, which clearly isn't the intention. I don't have a better wording for the first sentence. But I think its intention is clear from other parts of the spec, and it wouldn't be a disaster to simply drop the first sentence. [Section 2.2] Related to a previous comment [7]. [7] http://www.w3.org/XML/Group/2003/08/xmlschema-datamodel-comments#d0e325 "Some accessors can accept or return sequences." Aren't all values sequences? So "can" isn't accurate. It gives the impression that some other accessors "can" accept or return things other than sequences. Suggested fix: drop the word "can". [Section 2.2] "The following notation is used to denote sequence values: * V* denotes a sequence of zero or more items of type V. * V? denotes a sequence of exactly zero or one items of type V. * V+ denotes a sequence of one or more items of type V." It seems to me "only" the above 3 forms are used to denote sequence values. But from my understanding of the spec, "V" also denote a sequence (of length 1). Suggested fix: add the following as the first item: "* V denotes a sequence of one item of type V." [Section 3] "The data model also supports values that are not nodes. Examples of these are atomic values, sequences of atomic values, or sequences mixing nodes and atomic values." Since values in the data model are all sequences, it's not appropriate to list "atomic values" as an example. 3.2 Accessors applicable to one node type In section 5.11 "It is defined on all seven node types, but always returns the empty sequence for all nodes except elements." (Where "it" refers to the "nilled" accessor.) - The same statement is true for another 2 accessors "attributes" and "namespaces": they only have values for elements. Why there aren't similar notes for those 2 accessors? - These 3 kinds of accessors only apply for elements. Then why having them here (defined on all seven node types), instead of just having them on element nodes? 3.3 Referring to accessors and property values There are occurrences of phrases like "the parent of N", "among the children of N", "the string-value of N", etc., where it's not clear whether "parent/children/string-value" refer to a certain property of a node, or the returned value of a certain accessor. Either way, it needs to be clarified. If the intention is to refer to property values, then they should be in bold font; if the intention is to refer to accessors, then such convention should be made clear somewhere
Received on Tuesday, 10 February 2004 18:44:36 UTC