XML Query Comments to XML Schema (3rd part)

XML Query Comments to XML Schema (3rd part)

Here is the third set of comments from the XML Query Working Group on
the XML Schema last call Working Draft.
    http://www.w3.org/TR/2000/WD-xmlschema-0-20000407/
    http://www.w3.org/TR/2000/WD-xmlschema-1-20000407/
    http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/

In this version, we address the following issues:
   1.  Repetition
   2.  Data Integration
   3.  Schema Structures: Suggestions for Simplifying
   4.  Simple types vs. Complex Types

This list may not be exhaustive and the XML Query WG may provide
additional feedback at a later date.

- Philip Wadler, on behalf of the XML Query WG

1.  Repetition
--------------

The grammar of regular expressions in DTDs features three separate
operators, sequence (comma), choice (bar), and repeat (star).  In XML
Schema, the first two of these are denoted by `sequence' and `choice'
elements.  However, the third does not appear separately, and instead
`minOccurs' and `maxOccurs' may appear on every particle.  It would
better reflect the underlying structure of regular expressions to have
a separate `repeat' element, with `min' and `max' attributes.

For example, consider the DTD

	(a, b?, (c|d)+)

In the current XML Schema syntax, this is rendered as follows:

	<sequence>
	  <element ref="a"/>
	  <element ref="b" minOccurs="0" maxOccurs="1"/>
	  <choice minOccurs="1" maxOccurs="unbounded">
	    <element name="c"/>
	    <element name="d"/>
	  </choice>
	</sequence>

It would be better to use a syntax along the following lines:

	<sequence>
	  <element name="a"/>
	  <repeat minOccurs="0" maxOccurs="1">
	    <element name="b"/>
	  </repeat>
	  <repeat minOccurs="1" maxOccurs="unbounded">
	    <choice>
	      <element name="c"/>
	      <element name="d"/>
	    </choice>
	  </repeat>
	</sequence>

One could also define

	<star>...</star> to abbreviate
		<repeat minOccurs="0" maxOccurs="unbounded">...</repeat>

	<plus>...</plus> to abbreviate
		<repeat minOccurs="1" maxOccurs="unbounded">...</repeat>

	<option>...</option> to abbreviate 
		<repeat minOccurs="0" maxOccurs="1">...</repeat>

With these abbreviations, the above becomes

	<sequence>
	  <element name="a"/>
	  <option>
	    <element name="b"/>
	  </option>
	  <plus>
	    <choice>
	      <element name="c"/>
	      <element name="d"/>
	    </choice>
	  </plus>
	</sequence>

There are two related but separate questions here.

(a)  How should repetition be represented in the XML syntax of Schema.
(b)  How should repetition be represented in the PSV infoset.

The examples above dealt with (a) for conciseness, but point (b)
is equally important, if not more so.  It is important that the PSV
infoset have a simple and uniform structure to aid its use in
query processing (and other processing).

This design is better for the following reasons.

* The structure of the XML corresponds closely to the structure of the
parse tree.  This make it easier to read, easier to learn, and easier
to build processors.

* The definitions of other elements are simplified.  One need not
worry about which elements might have minOccurs and maxOccurs
attached.

* Some possible points of confusion are reduced.  For instance,
some reader may be confused by a declaration like the following.

  <element name="c" type="xsd:integer" fixed="5" minOccurs="5" maxOccurs="5"/>

The proposed new form

  <repeat min="5" max="5">
    <element name="c" fixed="5" type="xsd:integer"/>
  </repeat>

specifies much more clearly that there are 5 elements with fixed value 5.

Note that in XML Schema, by using the `group' element one can already
structure specifications in a way similar to repeat.

	<sequence>
	  <element name="a"/>
	  <group minOccurs="0" maxOccurs="1">
	    <element name="b"/>
	  </group>
	  <group minOccurs="1" maxOccurs="unbounded">
	    <choice>
	      <element name="c"/>
	      <element name="d"/>
	    </choice>
	  </group>
	</sequence>

Thus, `repeat' introduces no new issues not already dealt with by
XML Schema.

As a compromise position, some members of the Query working group
felt that it would be acceptable for Schema to support both the
current minOccurs and maxOccurs syntax and the new repeat syntax,
so long as the PSV infoset used the equivalent of the repeat
syntax.


2.  Data Integration
--------------------

There is a tension in Schema between expressiveness and ease of
parsing.  Schema disallows sibling elements to have the same
name but different types, in order to ensure that a document
can be parsed in a top down manner.  This restriction makes
difficult some aspects of data integration, as explained
in Section 1.3 of the following.

  XML Query Comments to XML Schema (1st part):
  <http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000May/0146.html>

There appears to be a simple way to significantly increase
expressiveness while not greatly increasing the complexity of parsing.
Namely, remove the above restriction on sibling elements, and replace
it with a different restriction: if sibling elements may have the same
name but different types, then these elements must be labelled with
xsi:type in the data.

We explain what this means by considering again the example
in Section 1.3 cited above.  This mentioned a data integration
query which yielded data in the following form.

    <authors>
       <author>Serge Abiteboul</author>
       <author>Peter Buneman</author>
       <author><first>Dan</first><last>Suciu</last><author>
    <authors>

One might wish to describe this data with a schema of the
following sort.

    <xsd:element name="result">
       <xsd:complexType>
          <xsd:sequence>
             <xsd:element name="author" type="xsd:string"/>
             <xsd:element name="author" type="xsd:string"/>
             <xsd:element name="author" type="first-last"/>
          </xsd:sequence>
        </xsd:complexType>
     </xsd:element>
     <xsd:complexType name="first-last-type">
        <xsd:element name="first" type="xsd:string"/>
        <xsd:element name="last" type="xsd:string"/>
     </xsd:complexType>

The given data cannot be parsed top-down if serialized as above.
However, it could be parsed top-down if serialized with xsi:type
information, as required by the above proposal.

    <authors>
       <author xsi:type="xsd:string">Serge Abiteboul</author>
       <author xsi:type="xsd:string">Peter Buneman</author>
       <author xsi:type="first-last"><first>Dan</first><last>Suciu</last><author>
    <authors>

Most data would not require xsi:type (in particular, any data that is
permitted under the current proposal and does not require xsi:type
under the current proposal would also not require xsi:type under the
new proposal).  However, data with sibling elements with the same
name, which is not permitted under the current proposal, would be
permitted under the new proposal, so long as xsi:type information is
present.  This would greatly ease data integration.

To further ease data integration, it would be helpful for xsi:type
to be able to refer to any type in a schema, including anonymous
types.  This might be achievable by use of xpath to select the
anonymous type.


3. Suggestions for simplifying Schema Structures
------------------------------------------------

The Query working group (QWG) think that the Schema Datatypes spec is well
written and well designed. QWG do not know whether the Schema Structures
spec is well designed; QWG do know that it is *definitely* badly
explained, and until it is explained better, it is difficult to tell
if it is badly designed.  In general, it is difficult to get a good
overview by reading the spec, and difficult to give good feedback,
because the spec is very hard to read.

After extensive review, QWG think the best approach might be to reorganize
the Structures specification so that there are two main sections, one devoted
to the Abstract Data Model, the other devoted to the declaration syntax and
the meaning of each kind of declaration. Concrete suggestions follow:

- For each Schema Component, there should be a complete enumeration
   of constraints and information set contributions. If these are discussed
   more fully in other sections of the document, the enumeration might take
   the form of a table that contains a brief explanation of the constraint and
   a hyperlink to the complete discussion.

   Currently, not only is this information scattered throughout the
   document, the document never tells how to find all of the constraints
   associated with a declaration, though it mentions several sections
   in which this information may "largely be found". If the document
   is unwilling to commit itself to placing all constraints in well-defined
   portions of the document, it is not possible to locate these constraints
   without reading the entire document.

- Let the markup reflect the abstractions of the data model.
   There are several places where the concepts in the XML
   syntax are not closely related to the concepts in the property
   set - for instance, "optional" is a clear concept that could
   be used in both places, and need not be translated into "min occurs"
   and "max occurs". If the section on XML representation says that
   attributes  may be "absent, optional, default, fixed, or
   required", there should be an explanation of what each of
   these mean, or at least a link to such an explanation,
   and these concepts should be clearly reflected in the
   abstractions used elsewhere. Otherwise, the reader has to
   master two different sets of concepts, and know how the
   two are related.

- Define what a Complex Type is. There are at least two approaches
   to this that seem reasonable:

   (1) A complex type associates a content model and a set of
   attributes with a type name, making it possible to answer
   the question, "do these two elements have the same type?".
   In this case, a complex type can be defined by a grammar.
   This seems to be the spirit of the current Schema Working Draft,
   and corresponds to what is actually declared in a complex
   type declaration.

   (2) There is no distinction between complex type and datatype;
   each defines a lexical space, a value space, and a set of
   facets. This approach leads to a unified type system, but
   is probably significantly more complex than the first approach.

- In the current document, Section 3 provides an abstract property
   set for the syntax described in Section 4. The purpose for this is
   given in a note that explains some processors will need to use
   alternative representations of the language defined in the XML
   representation. If approach (1) is taken, an abstract syntax
   could be used to describe an abstract language, of which the
   concrete XML is one possible representation. However, it is not
   clear that this has real advantages. I think that it is sufficient
   to note that alternative representations may occur in many systems,
   and to present (a) the concrete XML grammar, with a full explanation
   of what it defines, and (b) an abstract data model.

- Minimize the number of terms created, and define each term in the glossary.

- Do a complete typed Abstract Data Model.


4. Simple types vs. complex types
---------------------------------

One lack of orthogonality in XML Schema Part 1: Structures is that all
kinds of types cannot always be used in the same way.  Some members of
the Query working group felt that simple types be permitted wherever
complex types are.

This would result in a number of simplifications:

* The 'content' attribute (which specifies 'mixed', 'element-only',
or 'empty') may be eliminated.

* Rather than `mixed', which allows pcdata to appear anywhere, one can
specify exactly where pcdata is allowed.  (Of course, types must be
parseable and serializable, so one would not allow two sequential
occurrences of simple data not separated by elements.)

* Options can now work on simple types.  (This resolves the
issues about "union types" related to issues 1.1.1 and 1.1.2 in the
current draft.)

We would welcome an explanation of why the Schema group has
chosen to support a lack of orthogonality in this area.

Received on Friday, 2 June 2000 15:40:05 UTC