- From: Philip Wadler <wadler@research.bell-labs.com>
- Date: Fri, 02 Jun 2000 15:39:32 -0400
- To: www-xml-schema-comments@w3.org
- cc: w3c-xml-query-wg@w3.org
XML Query Comments to XML Schema (3rd part) Here is the third set of comments from the XML Query Working Group on the XML Schema last call Working Draft. http://www.w3.org/TR/2000/WD-xmlschema-0-20000407/ http://www.w3.org/TR/2000/WD-xmlschema-1-20000407/ http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/ In this version, we address the following issues: 1. Repetition 2. Data Integration 3. Schema Structures: Suggestions for Simplifying 4. Simple types vs. Complex Types This list may not be exhaustive and the XML Query WG may provide additional feedback at a later date. - Philip Wadler, on behalf of the XML Query WG 1. Repetition -------------- The grammar of regular expressions in DTDs features three separate operators, sequence (comma), choice (bar), and repeat (star). In XML Schema, the first two of these are denoted by `sequence' and `choice' elements. However, the third does not appear separately, and instead `minOccurs' and `maxOccurs' may appear on every particle. It would better reflect the underlying structure of regular expressions to have a separate `repeat' element, with `min' and `max' attributes. For example, consider the DTD (a, b?, (c|d)+) In the current XML Schema syntax, this is rendered as follows: <sequence> <element ref="a"/> <element ref="b" minOccurs="0" maxOccurs="1"/> <choice minOccurs="1" maxOccurs="unbounded"> <element name="c"/> <element name="d"/> </choice> </sequence> It would be better to use a syntax along the following lines: <sequence> <element name="a"/> <repeat minOccurs="0" maxOccurs="1"> <element name="b"/> </repeat> <repeat minOccurs="1" maxOccurs="unbounded"> <choice> <element name="c"/> <element name="d"/> </choice> </repeat> </sequence> One could also define <star>...</star> to abbreviate <repeat minOccurs="0" maxOccurs="unbounded">...</repeat> <plus>...</plus> to abbreviate <repeat minOccurs="1" maxOccurs="unbounded">...</repeat> <option>...</option> to abbreviate <repeat minOccurs="0" maxOccurs="1">...</repeat> With these abbreviations, the above becomes <sequence> <element name="a"/> <option> <element name="b"/> </option> <plus> <choice> <element name="c"/> <element name="d"/> </choice> </plus> </sequence> There are two related but separate questions here. (a) How should repetition be represented in the XML syntax of Schema. (b) How should repetition be represented in the PSV infoset. The examples above dealt with (a) for conciseness, but point (b) is equally important, if not more so. It is important that the PSV infoset have a simple and uniform structure to aid its use in query processing (and other processing). This design is better for the following reasons. * The structure of the XML corresponds closely to the structure of the parse tree. This make it easier to read, easier to learn, and easier to build processors. * The definitions of other elements are simplified. One need not worry about which elements might have minOccurs and maxOccurs attached. * Some possible points of confusion are reduced. For instance, some reader may be confused by a declaration like the following. <element name="c" type="xsd:integer" fixed="5" minOccurs="5" maxOccurs="5"/> The proposed new form <repeat min="5" max="5"> <element name="c" fixed="5" type="xsd:integer"/> </repeat> specifies much more clearly that there are 5 elements with fixed value 5. Note that in XML Schema, by using the `group' element one can already structure specifications in a way similar to repeat. <sequence> <element name="a"/> <group minOccurs="0" maxOccurs="1"> <element name="b"/> </group> <group minOccurs="1" maxOccurs="unbounded"> <choice> <element name="c"/> <element name="d"/> </choice> </group> </sequence> Thus, `repeat' introduces no new issues not already dealt with by XML Schema. As a compromise position, some members of the Query working group felt that it would be acceptable for Schema to support both the current minOccurs and maxOccurs syntax and the new repeat syntax, so long as the PSV infoset used the equivalent of the repeat syntax. 2. Data Integration -------------------- There is a tension in Schema between expressiveness and ease of parsing. Schema disallows sibling elements to have the same name but different types, in order to ensure that a document can be parsed in a top down manner. This restriction makes difficult some aspects of data integration, as explained in Section 1.3 of the following. XML Query Comments to XML Schema (1st part): <http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000May/0146.html> There appears to be a simple way to significantly increase expressiveness while not greatly increasing the complexity of parsing. Namely, remove the above restriction on sibling elements, and replace it with a different restriction: if sibling elements may have the same name but different types, then these elements must be labelled with xsi:type in the data. We explain what this means by considering again the example in Section 1.3 cited above. This mentioned a data integration query which yielded data in the following form. <authors> <author>Serge Abiteboul</author> <author>Peter Buneman</author> <author><first>Dan</first><last>Suciu</last><author> <authors> One might wish to describe this data with a schema of the following sort. <xsd:element name="result"> <xsd:complexType> <xsd:sequence> <xsd:element name="author" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name="author" type="first-last"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:complexType name="first-last-type"> <xsd:element name="first" type="xsd:string"/> <xsd:element name="last" type="xsd:string"/> </xsd:complexType> The given data cannot be parsed top-down if serialized as above. However, it could be parsed top-down if serialized with xsi:type information, as required by the above proposal. <authors> <author xsi:type="xsd:string">Serge Abiteboul</author> <author xsi:type="xsd:string">Peter Buneman</author> <author xsi:type="first-last"><first>Dan</first><last>Suciu</last><author> <authors> Most data would not require xsi:type (in particular, any data that is permitted under the current proposal and does not require xsi:type under the current proposal would also not require xsi:type under the new proposal). However, data with sibling elements with the same name, which is not permitted under the current proposal, would be permitted under the new proposal, so long as xsi:type information is present. This would greatly ease data integration. To further ease data integration, it would be helpful for xsi:type to be able to refer to any type in a schema, including anonymous types. This might be achievable by use of xpath to select the anonymous type. 3. Suggestions for simplifying Schema Structures ------------------------------------------------ The Query working group (QWG) think that the Schema Datatypes spec is well written and well designed. QWG do not know whether the Schema Structures spec is well designed; QWG do know that it is *definitely* badly explained, and until it is explained better, it is difficult to tell if it is badly designed. In general, it is difficult to get a good overview by reading the spec, and difficult to give good feedback, because the spec is very hard to read. After extensive review, QWG think the best approach might be to reorganize the Structures specification so that there are two main sections, one devoted to the Abstract Data Model, the other devoted to the declaration syntax and the meaning of each kind of declaration. Concrete suggestions follow: - For each Schema Component, there should be a complete enumeration of constraints and information set contributions. If these are discussed more fully in other sections of the document, the enumeration might take the form of a table that contains a brief explanation of the constraint and a hyperlink to the complete discussion. Currently, not only is this information scattered throughout the document, the document never tells how to find all of the constraints associated with a declaration, though it mentions several sections in which this information may "largely be found". If the document is unwilling to commit itself to placing all constraints in well-defined portions of the document, it is not possible to locate these constraints without reading the entire document. - Let the markup reflect the abstractions of the data model. There are several places where the concepts in the XML syntax are not closely related to the concepts in the property set - for instance, "optional" is a clear concept that could be used in both places, and need not be translated into "min occurs" and "max occurs". If the section on XML representation says that attributes may be "absent, optional, default, fixed, or required", there should be an explanation of what each of these mean, or at least a link to such an explanation, and these concepts should be clearly reflected in the abstractions used elsewhere. Otherwise, the reader has to master two different sets of concepts, and know how the two are related. - Define what a Complex Type is. There are at least two approaches to this that seem reasonable: (1) A complex type associates a content model and a set of attributes with a type name, making it possible to answer the question, "do these two elements have the same type?". In this case, a complex type can be defined by a grammar. This seems to be the spirit of the current Schema Working Draft, and corresponds to what is actually declared in a complex type declaration. (2) There is no distinction between complex type and datatype; each defines a lexical space, a value space, and a set of facets. This approach leads to a unified type system, but is probably significantly more complex than the first approach. - In the current document, Section 3 provides an abstract property set for the syntax described in Section 4. The purpose for this is given in a note that explains some processors will need to use alternative representations of the language defined in the XML representation. If approach (1) is taken, an abstract syntax could be used to describe an abstract language, of which the concrete XML is one possible representation. However, it is not clear that this has real advantages. I think that it is sufficient to note that alternative representations may occur in many systems, and to present (a) the concrete XML grammar, with a full explanation of what it defines, and (b) an abstract data model. - Minimize the number of terms created, and define each term in the glossary. - Do a complete typed Abstract Data Model. 4. Simple types vs. complex types --------------------------------- One lack of orthogonality in XML Schema Part 1: Structures is that all kinds of types cannot always be used in the same way. Some members of the Query working group felt that simple types be permitted wherever complex types are. This would result in a number of simplifications: * The 'content' attribute (which specifies 'mixed', 'element-only', or 'empty') may be eliminated. * Rather than `mixed', which allows pcdata to appear anywhere, one can specify exactly where pcdata is allowed. (Of course, types must be parseable and serializable, so one would not allow two sequential occurrences of simple data not separated by elements.) * Options can now work on simple types. (This resolves the issues about "union types" related to issues 1.1.1 and 1.1.2 in the current draft.) We would welcome an explanation of why the Schema group has chosen to support a lack of orthogonality in this area.
Received on Friday, 2 June 2000 15:40:05 UTC