Re: Simplifying XML Schema

I think this message from Bell Labs deserves more than a cursory reply.  The
current document is absurdly long and extremely complex.  While just a data
definition language, it is significantly longer than the Haskel standard
(which Phil Wadler chaired) and close to the size of, if not longer than,
the ML spec.  Both of these are full programming languages whose
specifications contain full formal semantics for the languages.  ML, in
particular, is fairly large and its data definition component is not
significantly less complex than xsdl ought to be.  Even worse, we've
probably breezed past 8879 - the SGML spec that XML has practically replaced
due to its _greater simplicity_, which we are about to remit to the dustbin
of history.

Generally speaking, unnecessary complexity is a sign of bad design.  The
leakage of schema information (xsi:type and xsi:null) into instances is an
ominous sign - it implies the design is less than crisp.  xsi:null worries
me because, while it was included to satisfy the needs of the database
community, by putting it in the base language, they've lost control over it
- there's nothing to to enforce use that in any way accords with sql
semantics and there may be a lot of nasty surprises coming down the pike.
If the dbms vendors had instead used xsdl to create their own standard
mechanism they could have strictly controlled semantics.

The recommendation for simpleTypes is very interesting if it can eliminate
the content attribute.  It also has the significant advantage over the
current spec that mixed content is no longer required to be a string.  It
would be great if we could eliminate the equally redundant derivedBy
attribute and unify refinement.

The xsi:type issue is strongly connected with the 3rd section on
context-independent vs. contex-dependent types.  While it is true that the
xsi:type issue was discussed at length (and - truth in advertising - I was
probably the only one arguing along the same lines as the authors) it was
done under the set of assumptions which, while not wrong - wrong is not the
right term for this - led to the current rococo result.  The issue of
context-independent vs. dependent is very important, but decisions have not
been made from that perspective.  Of course, having a unique URI for
everything referenceable from an instance (i.e., elements and types) ensures
context-independence and simplifies many things, but the group has not
chosen to support that.  While I don't think anything should stand in the
way of going to CR asap, I hope people will get a better understanding of
this and we will be able to modify things accordingly.

It's already pretty clear that the number 1 complaint we will hear is "why
is it so damn complicated".  The unstated part is "when it doesn't need to
be".  

Matthew

Message-Id: <200005122308.TAA15405584@nslocum.cs.bell-labs.com>
To:
&References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"www-xml-schem
a-comments@w3.org
<mailto:www-xml-schema-comments@w3.org?Subject=Re:%20Simplifying%20XML%20Sch
ema&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>
Cc: Mary Fernandez
<&References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"mff@research
.att.com
<mailto:mff@research.att.com?Subject=Re:%20Simplifying%20XML%20Schema&In-Rep
ly-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>>,
&References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"simeon@resear
ch.bell-labs.com
<mailto:simeon@research.bell-labs.com?Subject=Re:%20Simplifying%20XML%20Sche
ma&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>,
&References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"wadler@resear
ch.bell-labs.com
<mailto:wadler@research.bell-labs.com?Subject=Re:%20Simplifying%20XML%20Sche
ma&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>
Date: Fri, 12 May 2000 19:08:51 -0400
From: Philip Wadler
<&References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"wadler@resea
rch.bell-labs.com
<mailto:wadler@research.bell-labs.com?Subject=Re:%20Simplifying%20XML%20Sche
ma&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>>
Subject: Simplifying XML Schema


Simplifying XML Schema
----------------------

The current Schema proposal is complex.  Programmers have shown a
remarkable ability to put up with complexity, but we do not yet know
whether the XML community will be so forgiving.  We would like to
suggest that it is possible to greatly simplify XML Schema, while
not unduly limiting its power.  Indeed, some of the suggestions
below would both simplify Schema and extend its power at the same
time.

We are also asking the XML Query working group to support changes
along these lines, but in writing this letter we are not acting as
representatives of XML Query.

Yours sincerely,

Philip Wadler, Lucent
Jerome Simeon, Lucent
Mary Fernandez, AT&T

1. Clear separation between schema and data
-------------------------------------------

One of the nice feature of XML is that documents are
"self-describing".  Schema has two features which run counter to this
philosophy, xsi:type and xsi:null.  Our motto here is, `Keep schema
out of the data!'

1.1. xsi:type
-------------

Schema permits refinement in two forms: an element may be declared
as being a subclass of another element, and a type may be declared
as a subtype of another type.  This is explained in Section 4 of the
primer.

(When one element is a subclass of another element, Schema says the
first element is `in the equivalence class' of the second.  We use
`subclass' because it has the right connotations, whereas `equivalence
class' does not.)

When subtyping is used without subclassing, the document is required
to include type information.  Here's an example from Section 4 of the
primer.

    <shipTo export-code="1" xsi:type="ipo:UK-Address">
        <name>Helen Zoe</name>
        <street>47 Eden Street</street>
        <city>Cambridge</city>
        <postcode>CB1 1JR</postcode>
    </shipTo>

    <billTo xsi:type="ipo:US-Address">
        <name>Robert Smith</name>
        <street>8 Oak Avenue</street>
        <city>Old Town</city>
        <state>PA</state>
        <zip>95819</zip>
    </billTo>

If subclassing is combined with subtyping, the use of xsi:type
can be avoided.

    <shipTo export-code="1">
        <UK-Address>
            <name>Helen Zoe</name>
            <street>47 Eden Street</street>
            <city>Cambridge</city>
            <postcode>CB1 1JR</postcode>
        </UK-Address>
    </shipTo>

    <billTo>
        <US-Address>
            <name>Robert Smith</name>
            <street>8 Oak Avenue</street>
            <city>Old Town</city>
            <state>PA</state>
            <zip>95819</zip>
         </US-Address>
    </billTo>

This latter form is easily read by anyone who understands XML,
even if they do not understand XML Schema.

We feel that the extra complexity of xsi:type outweighs any
of its advantages.  We suggest that subtyping be tied to subclassing,
and that xsi:type be removed.

1.2.  xsi:null
--------------

Reading between the lines, it seems clear that xsi:null is included in
Schema to support some ways of using relational databases.  That is,
Schema is trying to help Query.  But it is not at all clear what the
Query group will decide about nulls.  We believe that xsi:null should
be removed from Schema.  Query should first decide on what mechanism is
required for nulls, and then discuss the situation with Schema if
Schema support is required.


2. Simple types vs. complex types
---------------------------------

One lack of orthogonality in XML Schema Part 1: Structures is that
simple types and complex types cannot always be used in the same
way. We suggest that simple types be permitted wherever complex types
are.

This would result in a number of simplifications:

* The 'content' attribute (which specifies 'mixed', 'element-only',
or 'empty') may be eliminated.

* Rather than `mixed', which allows pcdata to appear anywhere, one can
specify exactly where pcdata is allowed.

* One can specify that the presence of simple types is optional.

* This corresponds more directly to SGML and XML DTDs, which indicate
mixed content by explicitly mentioning PCDATA.

For example, we can now specify a LETTER element that consists
of a SALUTATION element, followed by some text, followed by a
CLOSING element.

      <xsd:element name='LETTER'>
         <xsd:element name='SALUTATION' type='xsd:string'/>
	 <xsd:simpleType type="string"/>
         <xsd:element name='CLOSING' type='xsd:string'/>
      </xsd:element>   

This is more precise than using `mixed', and, because it lists
the components in the order they appear, it is easier to read.

Of course, types must be parseable and serializable.  Usually, values
of primitive type can be space separated, the exception being strings
(which may themselves contain spaces).  Therefore, it is not allowed
to specify two successive occurrences of primitive type if one or both
of them is a string.


3. Context-independent types vs context-dependent types
-------------------------------------------------------

One of the great structuring principles of DTDs is that the elements
with the same name always have content of the same type.  Many users
of SGML take this as the foundation stone for structuring a document.

Schema departs from this: the same elements with the same name may
have contents of differing type, depending on the context where they
appear.  However, Schema goes only halfway toward this, as there are
some complex restrictions (apparently intended to ease parsing).

We suggest that the design should be a good horse or a good elephant,
not a hybrid beast.  Either choose a completely context-independent
design, similar to DTDs, or choose a completely context-dependent
approach, similar to that pursued by, for instance, the work on Xduce
at the University of Pennsylvania.  In mathematicians terms, we should
either deal with trees that represent context-free grammars (which can
be parsed by top-down deterministic tree automata), or with regular
trees (which can be parsed by either bottom-up deterministic,
bottom-up non-deterministic, or top-down non-deterministic automata;
the three are equivalent).

3.1  Context-independent types
------------------------------

To make types context-independent, all that is needed is to
change Schema to only allow global element declarations.

Advantages:

* Context-independent structuring is simple, and has a long history
of use in SGML community.  Many users of SGML are gobsmacked over
Schema's introduction of so much extra complexity to support features
that seem to them to be positively counterproductive.

* In Schema, subclasses can be defined only for global elements (as
explained in Section 1.3.3.2.3 of Schema Structures).  In the Software
Engineering community, there is a name for features that look
attractive (like local element declarations) but inhibit the use of
more powerful structuring techniques (like subclassing).  They are
called "bad".  The context-independent approach restores compatibilty
with subclassing.

* Context-independent Schema are easy to parse, using top-down
deterministic tree automata.  Further, parsing can be incremental --
it is not necessary to read the entire tree into memory, and the space
required is proportional to the depth of the tree, not its size.

* All the current complexity of associating types with elements
in the infoset becomes unnecessary.  All one needs is a table mapping
element names to types.  The simpler system will make it easier for
other processors to exploit types.

3.2  Context-dependent types
----------------------------

To make types fully context-dependent, Schema should (at least) remove
the restriction that all sibling elements with the same name should
have the same type.

Advantages:

* The union of any two schema is a schema, which facilitates the
manipulation of multiple documents.  With context-independent
types, one must use namespaces.

* Context-dependent types are more expressive.  In particular, they
can be helpful for importing other data representations into XML.
For instance, one might have two relational tables, A and B, one where
ID is an integer, and one where ID is a string.  With context-
dependent types, these are easily described in a single schema,
since an ID element inside an A element may have a different type
than an ID element inside a B element.  With context-independent
types, one must either use renaming, or put the two tables in different
namespaces.

* The type system may give much more precise information for queries
or other applications.

Disadvantages:

* Parsing is more complex.  Parsing may be achieved by either
bottom-up deterministic, bottom-up non-deterministic, or top-down
non-deterministic; the three are equivalent.  Incremental parsing
is not always possible, and in the worst case the space required
is proportional to the size of the whole tree.

Received on Thursday, 18 May 2000 21:59:17 UTC