Comment on XSD 1.1 from Rick Jelliffe on 2009-05-13 (www-tag@w3.org from May 2009)

From: Rick Jelliffe <rjelliffe@allette.com.au>
Date: Thu, 14 May 2009 01:25:38 +1000
To: www-xml-schema-comments@w3.org, www-tag@w3.org
Message-ID: <4A0AE672.7050202@allette.com.au>
I would like to register with the W3C TAG and the W3C XML Schema WG 
that, on having considered the XSD 1.1 draft, I think it is exactly the 
wrong direction for the WG and W3C to be taking.  That is, while each 
individual decision may be well-founded, and each change justifiable and 
beneficial, the total effect will not help get us out of the mess that 
XML Schemas has created, but mire us further in it.

I see this as highly analogous to the situation with the SGML 5-year 
review at ISO in the early 1990s. Many small solutions to individual 
problems had been made, and many wizz-bang new ideas added, and there 
were many worthy new things on the cards.

But the fundamental problem was SGML was too big. The approach was of 
course to slim it down to XML, and to reintroduce many of the cast-off 
features and ideas (DTDs, modules) into layers on top of XML (schemas, 
namespaces.)

(A further parallel may indeed be that a change in forum was necessary 
in order to get this change: in a certain sense the original developers 
of SGML were "part of the problem" not "part of the solution."  Not 
because of malice or ineptitude; quite the reverse. The dynamics, 
personalities and goals of the working group were only capable of change 
in the direction of neatness and expansion. Indeed, I know that many on 
the W3C Schema WG are acutely aware of these issues, but perhaps the 
stars have never been aligned to address this. Since the W3C TAG itself 
has such a rich representation from the XML Schema WG, I hope that they 
may be conduits for fresh-thinking from the TAG and not conduits for 
rationalizations from the Schema WG.)

Comments on the problem
---------------------------

That XML Schemas is in a crisis and has failed to meet some of its basic 
goals can be seen by the work on XML Schema Patterns for Databinding. 
That two such comprehensive lists were necessary is a sign of bad layering.

Indeed, if considering the original requirements document for XML 
Schemas, http://www.w3.org/TR/NOTE-xml-schema-req, its shortcomings 
become more manifest. For example, in the Usage Scenarios, XML Schemas 
has not been successful for
 
 4) Traditional document authoring/editing governed by schema constraints. 
    (DTDs and RELAX NG have large inroads in this area. For example 
OASIS ODF. I note that even for the
    XML Schemas for ISO OOXML [DIS29500], which had been written to use 
a very conservative subset of
    XML Schemas, it turned out that Xerces would not accept schemas 
allowed by Microsoft's validators,
    both of which being well-regarded and mature implementations. )

 5) Use schema to help query formulation and optimization. 
    (The current draft has to change its type model to fit XQuery)

 6) Open and uniform transfer of data between applications, including 
databases
    (See the databinding comments above.)

Furthermore even in the online application scenarios 1, 2, 3, and 7, the 
heavy weight processing that XML Schemas requires and the complexity of 
its concepts has meant that it is rarely actually used for validation, 
even as it is so inadequate for databinding. 

So if it is not congenial for validation, and it is not a success for 
reliable databinding, is it at least good for documentation?  In fact, 
the verbosity of XML Schemas makes it utterly unusable for presenting to 
humans to understand a document's structure. In this regard, I note that 
the recent HTML 5 drafts have reverted to something akin to RELAX NG 
Compact Syntax (which looks like DTD content models and has a standard 
mapping to the XML form.) 

Further if XML Schemas is not useable for documentation, is it useful 
for generating useful validation messages for humans? The answer is 
clearly that the messages produced by implementations of XML Schemas are 
not much use, particular the obscure structural messages. As someone who 
has both implemented most of XML Schemas  (a converter to Schematron) 
and who has customized the messages from various schema processors, I 
don't see how some of the messages can be made human-friendly, since 
they relate to obscure rules in XML Schemas.

And if XML Schemas is not good for validation, does it redeem itself by 
winning over implementers with a good standard?  It is no secret that 
the XML Schemas Structures standard is the very model of an 
impenetrable, guru-inducing standard. But, having work in the W3C XML 
Schema WG at the time of the first release, and deeply respecting the 
editors and working group members, I believe this is not a fixable fault 
with the documentation, but a reflection of the brain-numbing technology.

I have two personal anecdotes about this. In 2001 I had a contract from 
Manning Press to write a book on XML Schemas, in particular explaining 
the standard. After three months of full-time work on this, I abandoned 
the project and repaid my advances at my loss, because I decided that 
trying to make a silk purse out of a sow's ear would be either 
impossible or irresponsible.  The second anecdote is that when making 
our implementation of XML Schemas (a project initially funded by JSTOR 
which is making its leisurely way towards open source ) we twice had 
programmers threaten to resign because working on XML Schemas 
implementation was too unpleasant. One of these programmers was 
subsequently headhunted by Microsoft and the other is currently working 
on his PhD. in Computer Science so they are not idiots or defeatists; 
and we have a history of high retention rates.

Continuing further looking at the original requirements, we see the 
following puported design principles:

   1. more expressive than XML DTDs;
   2. expressed in XML;
   3. self-describing;
   4. usable by a wide variety of applications that employ XML;
   5. straightforwardly usable on the Internet;
   6. optimized for interoperability;
   7. simple enough to implement with modest design and runtime resources;
   8. coordinated with relevant W3C specs (XML Information Set, Links,
      Namespaces, Pointers, Style and Syntax, as well as DOM, HTML, and
      RDF Schema).

I contend that it is apparent that the changes proposed for XML Schemas 
1.1 do nothing to address the shortfalls in meeting these goals that 
have been a bugbear since XML Schemas 1.0.  In particular, it fails
     4. see the databinding and related comments above
     5. there is nothing straightforward about  XSD, and  it is too 
verbose to  download
     6. see the databinding and related comments above: it is manestly a 
disaster for interoperability
     7. XSD is manifestly not simple to implement
     8. the PSVI (post-schema validation infoset) represents a 
fundamental break with the basic relevant XML Specs. Indeed, it might be 
said that XML Schemas are not schemas for documents, but schemas for 
databases that have an XML serialization. The two are not the same.

So, allowing for argument that XML Schemas may be so deficient in these 
areas and so complex, can it justify itself as allowing very 
sophisticated document constraints? Clearly the answer is no, certainly 
for Part 1 Structures. The rival language to XML Schemas (i.e. OASIS/ISO 
RELAX NG) is far more powerful, and the alternative (which is also a 
complement) for non-grammar/non-datatype constraints and assertions 
(i.e. ISO Schematron) is far more powerful.

XML Schemas has a very poor bang-per-buck ratio. There are many 
significant classes of document structures it is incapable of being 
useful for: for example, SVG, XSLT.  Indeed, it may be argued that these 
kinds of tricky structures are exactly the kinds of structures most 
calling out for validation.

Finally, if the language is not very good for structural constraints, is 
it at least good for document evolution? The answer here again is no. 
Experience with large schemas has shown that the XML Schemas complex 
type derivation facilities are quite bogus:  the type extension 
mechanism introduces not only an extra concept, but causes a fragile 
base-class-like problem for maintenance. And the type derivation by 
restriction mechanism does not simplify declarations. 

I do have many other specific issues as well, which I won't bore readers 
with: they can be summarized by the comment that XML Schemas 1.1 may 
address the kinds of problems that you might want to validate in 1999, 
but not the kinds of problems found in XML as practised in 2009: for 
example, foreign codeslists, and the abandonment of large XML documents 
in favour of either XML-in-ZIP or XML-on-filesystem collections of 
smaller documents linked by URL and other IDs.

I should acknowledge that there are indeed many successful uses of XML 
Schemas. I see no evidence that these successful uses are because of any 
particular excellence in XML Schemas that would not be possible in other 
schema languages.

A proposed solution
---------------------

I therefore ask the TAG to instruct, influence or otherwise encourage 
the XML Schema Working Group to put XSD 1.1 on hold and instead to work 
on a radical relayering into a two-layer model. Some of the XSD 1.1 
changes would make their way into the basic layer, some would make their 
way into the advanced layer which would be equivalent to the proposed 
XSD 1.1.

In concrete terms, I propose this:

1) A radically simpler schema language, compatible as much as a possible 
with the current XSD 1.0 syntax, be created. It should have the 
following properties:

     i) It should follow ISO RELAX NG in all relevant design decisions, 
and be trivially translatable to and from RELAX NG.
     ii) In doing so, it should remove as many of the patterns 
identified as problematic for databinding
     iii) It should have no concept of structural type derivation: no 
extension or restriction of complex types. It need not support any 
simple type derivation or facets, though it would support those the 
built-in derived types of XSD.
     iv) It should have no obscure rules such as UPA that are not 
required by RELAX NG.
     v) It should have no constraints or requirements for streamable 
implementation

 2) A secondary layer which adds:
    
     i) Complex type derivation
     ii) UPA, naming, and other obscure rules
     iii) Features problematic for databinding and to allow streaming 
validation would be allowed

The bottom line is that the new simpler language would not be 
type-based, nor would it require 1-unambigous schemas. Both those 
things, which are currently presented as core to the mechanics of XML 
Schemas would become additional assertions to be used or checked by the 
full language and its processors.

There are many details and issues, of course, but I believe this is more 
straightforward than may be thought. In any case, it is necessary to 
bring XML Schemas to its full potential for being useful on the web, 
rather than the hindrance and snare it currently is.  There is a 
misapprehension, in particular, that RELAX NG cannot be used for 
databinding; in fact, the Java API for ODF was created by a databinding 
tool for RELAX NG, so this is hardly

Cheers
Rick Jelliffe

Editor, ISO/IEC 19757-3:2006 Information technology -- Document Schema 
Definition Language (DSDL) -- Part 3: Rule-based validation -- Schematron

Invited expert, ISO/IEC SC34 WG1 Schema languages
Invited expert, ISO/IEC SC34 WG4 Office Open XML
Formally Australian delegate, ISO/IEC WG 8 (e.g. SC34)
Formerly member (for Academia Sinica), W3C XML Schema WG
Formerly invited expert, W3C I18n SIG
Formerly invited expert, W3C XML WG (e.g. SIG)

Author, The XML & SGML Cookbook, Recipes for Structured Information 
Management,
   Charles Goldfarb series, Prentice Hall, 1998.
Received on Wednesday, 13 May 2009 15:26:24 UTC