Simplifying XML Schema

Simplifying XML Schema
----------------------

The current Schema proposal is complex.  Programmers have shown a
remarkable ability to put up with complexity, but we do not yet know
whether the XML community will be so forgiving.  We would like to
suggest that it is possible to greatly simplify XML Schema, while
not unduly limiting its power.  Indeed, some of the suggestions
below would both simplify Schema and extend its power at the same
time.

We are also asking the XML Query working group to support changes
along these lines, but in writing this letter we are not acting as
representatives of XML Query.

Yours sincerely,

Philip Wadler, Lucent
Jerome Simeon, Lucent
Mary Fernandez, AT&T

1. Clear separation between schema and data
-------------------------------------------

One of the nice feature of XML is that documents are
"self-describing".  Schema has two features which run counter to this
philosophy, xsi:type and xsi:null.  Our motto here is, `Keep schema
out of the data!'

1.1. xsi:type
-------------

Schema permits refinement in two forms: an element may be declared
as being a subclass of another element, and a type may be declared
as a subtype of another type.  This is explained in Section 4 of the
primer.

(When one element is a subclass of another element, Schema says the
first element is `in the equivalence class' of the second.  We use
`subclass' because it has the right connotations, whereas `equivalence
class' does not.)

When subtyping is used without subclassing, the document is required
to include type information.  Here's an example from Section 4 of the
primer.

    <shipTo export-code="1" xsi:type="ipo:UK-Address">
        <name>Helen Zoe</name>
        <street>47 Eden Street</street>
        <city>Cambridge</city>
        <postcode>CB1 1JR</postcode>
    </shipTo>

    <billTo xsi:type="ipo:US-Address">
        <name>Robert Smith</name>
        <street>8 Oak Avenue</street>
        <city>Old Town</city>
        <state>PA</state>
        <zip>95819</zip>
    </billTo>

If subclassing is combined with subtyping, the use of xsi:type
can be avoided.

    <shipTo export-code="1">
        <UK-Address>
            <name>Helen Zoe</name>
            <street>47 Eden Street</street>
            <city>Cambridge</city>
            <postcode>CB1 1JR</postcode>
        </UK-Address>
    </shipTo>

    <billTo>
        <US-Address>
            <name>Robert Smith</name>
            <street>8 Oak Avenue</street>
            <city>Old Town</city>
            <state>PA</state>
            <zip>95819</zip>
         </US-Address>
    </billTo>

This latter form is easily read by anyone who understands XML,
even if they do not understand XML Schema.

We feel that the extra complexity of xsi:type outweighs any
of its advantages.  We suggest that subtyping be tied to subclassing,
and that xsi:type be removed.

1.2.  xsi:null
--------------

Reading between the lines, it seems clear that xsi:null is included in
Schema to support some ways of using relational databases.  That is,
Schema is trying to help Query.  But it is not at all clear what the
Query group will decide about nulls.  We believe that xsi:null should
be removed from Schema.  Query should first decide on what mechanism is
required for nulls, and then discuss the situation with Schema if
Schema support is required.


2. Simple types vs. complex types
---------------------------------

One lack of orthogonality in XML Schema Part 1: Structures is that
simple types and complex types cannot always be used in the same
way. We suggest that simple types be permitted wherever complex types
are.

This would result in a number of simplifications:

* The 'content' attribute (which specifies 'mixed', 'element-only',
or 'empty') may be eliminated.

* Rather than `mixed', which allows pcdata to appear anywhere, one can
specify exactly where pcdata is allowed.

* One can specify that the presence of simple types is optional.

* This corresponds more directly to SGML and XML DTDs, which indicate
mixed content by explicitly mentioning PCDATA.

For example, we can now specify a LETTER element that consists
of a SALUTATION element, followed by some text, followed by a
CLOSING element.

      <xsd:element name='LETTER'>
         <xsd:element name='SALUTATION' type='xsd:string'/>
	 <xsd:simpleType type="string"/>
         <xsd:element name='CLOSING' type='xsd:string'/>
      </xsd:element>   

This is more precise than using `mixed', and, because it lists
the components in the order they appear, it is easier to read.

Of course, types must be parseable and serializable.  Usually, values
of primitive type can be space separated, the exception being strings
(which may themselves contain spaces).  Therefore, it is not allowed
to specify two successive occurrences of primitive type if one or both
of them is a string.


3. Context-independent types vs context-dependent types
-------------------------------------------------------

One of the great structuring principles of DTDs is that the elements
with the same name always have content of the same type.  Many users
of SGML take this as the foundation stone for structuring a document.

Schema departs from this: the same elements with the same name may
have contents of differing type, depending on the context where they
appear.  However, Schema goes only halfway toward this, as there are
some complex restrictions (apparently intended to ease parsing).

We suggest that the design should be a good horse or a good elephant,
not a hybrid beast.  Either choose a completely context-independent
design, similar to DTDs, or choose a completely context-dependent
approach, similar to that pursued by, for instance, the work on Xduce
at the University of Pennsylvania.  In mathematicians terms, we should
either deal with trees that represent context-free grammars (which can
be parsed by top-down deterministic tree automata), or with regular
trees (which can be parsed by either bottom-up deterministic,
bottom-up non-deterministic, or top-down non-deterministic automata;
the three are equivalent).

3.1  Context-independent types
------------------------------

To make types context-independent, all that is needed is to
change Schema to only allow global element declarations.

Advantages:

* Context-independent structuring is simple, and has a long history
of use in SGML community.  Many users of SGML are gobsmacked over
Schema's introduction of so much extra complexity to support features
that seem to them to be positively counterproductive.

* In Schema, subclasses can be defined only for global elements (as
explained in Section 1.3.3.2.3 of Schema Structures).  In the Software
Engineering community, there is a name for features that look
attractive (like local element declarations) but inhibit the use of
more powerful structuring techniques (like subclassing).  They are
called "bad".  The context-independent approach restores compatibilty
with subclassing.

* Context-independent Schema are easy to parse, using top-down
deterministic tree automata.  Further, parsing can be incremental --
it is not necessary to read the entire tree into memory, and the space
required is proportional to the depth of the tree, not its size.

* All the current complexity of associating types with elements
in the infoset becomes unnecessary.  All one needs is a table mapping
element names to types.  The simpler system will make it easier for
other processors to exploit types.

3.2  Context-dependent types
----------------------------

To make types fully context-dependent, Schema should (at least) remove
the restriction that all sibling elements with the same name should
have the same type.

Advantages:

* The union of any two schema is a schema, which facilitates the
manipulation of multiple documents.  With context-independent
types, one must use namespaces.

* Context-dependent types are more expressive.  In particular, they
can be helpful for importing other data representations into XML.
For instance, one might have two relational tables, A and B, one where
ID is an integer, and one where ID is a string.  With context-
dependent types, these are easily described in a single schema,
since an ID element inside an A element may have a different type
than an ID element inside a B element.  With context-independent
types, one must either use renaming, or put the two tables in different
namespaces.

* The type system may give much more precise information for queries
or other applications.

Disadvantages:

* Parsing is more complex.  Parsing may be achieved by either
bottom-up deterministic, bottom-up non-deterministic, or top-down
non-deterministic; the three are equivalent.  Incremental parsing
is not always possible, and in the worst case the space required
is proportional to the size of the whole tree.

Received on Friday, 12 May 2000 19:10:03 UTC