A Call for Rapprochement between W3C XSD and ISO DSDL:
A Non-Intrusive Extension Framework for XSD 1.1 to Support Schematron and Beyond

Rick Jelliffe

2006-03-14

This note is a contribution to discussions on adding various kinds of constraint checking to XSD 1.1. The bottom line is a call for a rapprochement between W3C WSD and ISO DSDL: ISO DSDL is not a stalking horse for RELAX NG but should be considered a valuable and primary resource for little languages and approaches for evolving XSD in positive direction.

ISO 19757 Document Schema Description Languages (DSDL) is a multi-part standard for standardizing small, narrow-focus schema languages. It is often portrayed as some kind of attempted competitor to XSD, due to OASIS RELAX NG being one part, but it ain't necessarily so. If you consider ISO DSDL parts 3 and on as a series of small schema languages designed to complement any grammar-based schema language without adding to monolithic complexity, then XSD is clearly the primary potential adopter of ISO DSDL.

The current state of affairs with the W3C XSD WG reminds me of the SGML working group at ISO when we enhancing IS 8879 SGML to encompass XML which we had recently developed through W3C. One group felt that we just needed to parameterize SGML more, to complicate it, in order to cope with the variations required by XML; another group, which I belonged to, instead felt that layering was the answer: at a certain point it becomes positively harmful to add complications to a base specification. So we added an “additional constraints” link (“SEEALSO”), which allowed the SGML constraints to be extended by an external document: SGML validity remained determinate and unitary, and the additional constraints could be validated as a further, different type of validity (i.e. XML well-formedness.)

What is the relevance to XSD? XSD is in the same position as SGML was a decade ago: large, stiflingly difficult to implement, and with a strong requirement not weaken determinate validity. Similarly, this requirement not weaken validity is mistakenly opposed to ideas of layering. In fact the reverse is true: layered systems are easier to test, implement and reason about.

I write not only as the developer of Schematron, and former member of both the W3C XSD WG and the ISO DSDL WG, but also as a commercial developer of schema-related products including XSD. It has long been obvious to me that the XSD-inspired dazedness would eventually clear and that calls for schema capabilities beyond or above that of simple grammars would then have their season: the intent of ISO DSDL has been to collect such little languages for early adopters and for general commmunity benefit as the fog/panic/exploration of XSD clears.

Let me frank, but certainly with no disrespect intended. When XSD was developed, perhaps a majority of the then XSD WG had never actually written a serious DTD or other schema for XML (or SGML.) This perhaps ultimately showed itself in a certain obsessional intricacy in some minor areas (nillibility, the lack of integration of keys and uniqueness into the type system, extension by suffixation only, elementFormDefault, etc.) at the expense of other major areas (patterns on mixed content for example.) I suspect that probably the majority of the current XSD WG may have never written a constraint-based schema for XML, e.g. with Xlinkit, XCSL, Schematron or even XSLT. One cannot after all be a specialist in everything! The tendency of the XSD WG may therefore to be less aware than grassroots users of the advantages, characteristics and opportunities afforded by Schematron, and therefore relegate it to a few convenient categories such as “co-occurrence constraint.” Schematron was developed in 1999, and has continued in its current popularity solely because of its general-purpose utility, not because of any hype or party spirit: XSD users favour it equally with RELAX NG and DTD users. It is used for detecting conflicting flight plans over Belgium, checking software architecture rules in USA, checking local government forms in Japan, and the conformance of documents to house rules by big three publishers including here in Australia.

Executive Summary

Support for simple co-occurrence constraints is better done by allowing attributes as particles in content models rather than by using path expressions. Recommend to adopt the mechanism used successfully by ISO DSDL Part 2 (RELAX NG).
XSD needs an extension mechanism which will allow embedded little languages with constraints required for extended validity. Recommend new PSVI properties such as [extended validity] to support extensibility. Raid and support the ISO DSDL effort for appropriate extension languages.
XSD needs a constraint language, regardless of the support for 1) above. It should use the extension mechanism in 2) above. Recommend ISO DSDL Part 3 (Schematron) as a required (or strongly recommended) extension.
XSLT Keys and Uniqueness probably should be moved out of Part 1 and re-cast as an extension. I don't suppose this is feasible, for reason of scaring the horses. But it is an example of exactly the kind of schema language that should be an extension. Because key and uniqueness constraints are embedded in the Structures specification currently, there is no XSD-compliant way in which developers can experiment and evolve new schema languages. The danger of this is that the XSD WG is forced to do armchair speculative development of enhancements (“yeah that sounds good”): a recipe for perpetual premature standardization, inadequate testing, and a sure way to gather dead wood.

Attributes in Content Models

Support for simple co-occurrence constraints is better done by allowing attributes as particles in content models rather than by using path expressions. The approach used by ISO RELAX NG should be adopted: it has proved to be straightforward to implement, easy for users to understand, is declarative, streamable. This would require, I believe, no changes to the PSVI.

Adopting this does not entail any notion of somehow saying “RELAX NG was right and we were wrong”; on the contrary, interested users would rejoice that XSD was adopting proven technology. RELAX NG adopted this feature late, it was not obvious to James Clark and Murato Makoto etc. that it was the correct feature to adopt. (I believe I was one of the first to suggest it.) But it has proven itself. I hope the XSD WG relentless refuses to take part in any childish NIH-ism in this: XSD's earliest development was guided by the proven experience with various deployed schema languages. I strongly recommend that the XSD WG discipline itself to adopt proven features from existing deployed schema languages when they are available, as in this case.

Even though I am obviously a fan of Xpaths, introducing some reduced Xpath based syntax is, I believe, the wrong approach here: while technically feasible it plays to all XSD's weaknesses. It complicates understanding while only addressing one small area: exactly the kind of “not enough bangs per buck” that XSD is notorious for.

In particular, one of Paul Biron's useful suggestions, to use the streaming subset of Xpaths to identify a node whose presence provides the condition for the occurrence of an attribute or element, is, I think on reflection, the wrong way to go. First because the content model enhancement above is simpler, cleaner and roughly equivalent. Second because it uses paths for what they are OK at, but does not use them for where they shine: with value-based predicates and with random access. Third because the idea that just providing the most basic occurrence constraints will actually satisfy user requirements is wrong-headed: a tokenistic path language will merely temporarily shift the boundary at which user frustration with XSD's power sets in.

Extension Framework

XSD needs an extension mechanism which will allow embedded little languages with constraints required for extended validity.

XSD is notoriously under-layered and complicated to reason about. That even vendors so freely admit the difficulties they have faced in implementing XSD properly should be uttermost in the mind of the XSD WG, I believe; the failure of implementability in XSD must not be quibbled out of, especially given that XSD had a long gestation period that, one would expect, would have made early implementations higher quality than expected.

So the WG needs to adopt a very different mindset: I am not talking about changing the PSVI approach or type derivation, I am talking of the futility of “valid means valid everywhere in all implementations” when combined with a monolithic architecture. The failure of XSD implementations to provide consistent validity results is to some extent attributable to this monolithic architecture. I suggest that the problem is not with “valid means valid everywhere in all implementations” but in the lack of extensibility in XSD: Appinfo is not enough.

My suggestion for an extension mechanism is below. The focus of ISO DSDL (Document Schema Description Languages) is no to provide a standard library of little languages, suitable for XSD to include or allow by reference. These include ISO DSDL Part 3: DTLL (Datatype Language Library) and ISO DSDL Part 7: CRDL (Character Repertoire Description Language). In case there is any feeling that these are somehow “anti W3C” technologies, I perhaps should note that DTLL was developed by Jeni Tennison, invited expert to the W3C XSLT WG, while CRDL was developed by Martin Duerst, long time head of W3C Internationalization. Indeed, CRDL is based on a technical note at the W3C

Extension Framework Details

All places where <xsd:appinfo> are allowed and the top-level should allow a new element <xsd:extension>, allowing any element in a non-XSD namespace

Attributes:

[Extended validation attempted] (Assessment Outcome (Attribute)
[Extended validity] (Assessment Output (Attribute))
[Extended validity diagnostics] (Assessment Output (Attribute))

Elements

[Extended validation attempted] (Assessment Outcome (Element)
[Extended validity] (Assessment Output (Element))
[Extended validity diagnostics] (Assessment Output (Element))

Document Root

[Extended validation attempted] (Assessment Outcome (Document Root)
[Extended validity] (Assessment Output (Document Root))
[Extended validity diagnostics] (Assessment Output (Document Root))

[Extended validity] is defined as [validity] plus successful use of all elements in relevant extension elements. Importantly, this allows an on-ramp for implementations to keep their current notations of validity: they can allow but ignore all extensions. However, for extended-validity (which should become the new default for implementations to support) validation fails if either there is an element in an unknown namespace (i.e. One for which the schema implementation does not support) or if validation with those constraints fail, then extended validation fails. This satisfies the important objection to “optional” validation: extended validity always means extended validity.

Note that it is a design requirement in XSD that [Validity] can be assessed in a single-pass streaming fashion. It is not a design requirement that [Extended Validity] can be assessed in this manner. This split commends a layered approach.

Note that the extended validity of the Document Root refers to outcomes of validating extensions defined on the document root, and should not be confused with “the validity of the document”.

[Extended validation attempted] gives a list of the namespaces of the children of the relevant extension elements, which provide keys for different kinds of extended validation.

[Extended validity diagnostics] are lists of [namespace, text] pairs, which provide the namespace of the extension coupled with a human-readable text message, for example as generated dynamically by Schematron. (Note: the PSVI extension do not limit the ability of an API to report other information from schemas for various uses, or to perform different kinds of non-standard validations.)

The presence of these extra PSVI items is the key to extensibility. I don't believe any “required-extension” mechanism is needed or warranted.

Schematron as a Required Extension

XSD 1.1 should define ISO Schematron as a required or strongly recommended extension.

To some extent, attempting to cover all important bases with exhaustive declarative enhanements to XSD becomes an exercise in tail-chasing: even if XSD is extended with a dozen new co-occurrence constraint elements, there will still be the need for a general purpose constraint language. And, indeed, the best way to determine which constraints should be generalized into some first-class property in XSD is to first provide a general purpose constraint language like Schematron to gather information and increase user and WG expertise.

The subset of Schematron used conforms to ISO 19757-3 Information technology — Document Schema Definition Languages (DSDL) — Part 3: Rule-based validation — Schematron (2006) Annex F: Use of Schematron as a Vocabulary. The namespace used is http://purl.oclc.org/dsdl/schematron

The following effective DTD are the required elements and attributes of the subset. ISO Schematron defines other elements: it is an error for them to be present. ISO Schematron defines other attributes: it is not an error for these to be present; they may be ignored.

<!ELEMENT sch:rule (sch:let*, (sch:assert | sch:report)+)>
<!ATTRIBUTE sch:rule
   context (.) #FIXED '.'
   id CDATA #IMPLIED>
<!ELEMENT sch:let EMPTY>
<!ELEMENT sch:let 
   name  CDATA #REQUIRED
   value CDATA #REQUIRED >
<!ELEMENT sch:assert (#PCDATA  | sch:span | sch:emph | sch:dir | sch:name | sch:value-of)*>
<!ATTRIBUTE sch:assert
   test CDATA #REQUIRED>
<!ELEMENT sch:report (#PCDATA | sch:span | sch:emph | sch:dir | sch:name  | sch:value-of)*>
<!ATTRIBUTE sch:pattern
   test CDATA #REQUIRED>
<!ELEMENT sch:span (#PCDATA)>
<!ELEMENT sch:emph (#PCDATA)>
<!ELEMENT sch:dir (#PCDATA)>
<!ELEMENT sch:name EMPTY>
<!ATTLIST sch:name
 select CDATA #IMPLIED >
<!ELEMENT sch:value-of EMPTY>
<!ATTLIST sch:name
   select CDATA #IMPLIED >

Note that in this subset:

The context attribute is restricted to be “.”. In the case of an <extension> element that appears at the top-level of a schema rather than in a content model, this is “/” or the document root node (not the root element). For example, this allows a constraining the top-level element of any document to a certain range.
No special presentation processing is required for the text of elements span, emph and dir in the PSVI.
The element sch:name should be resolved to the qname of the local element attribute (or type if that is ever possible.)
Phases, diagnostics, patterns, abstract rules and abstract patterns are not part of the subset defined.
The path expression in the test attribute is interpreted as a boolean expression; it may not resolve to a particular type or node. For simple co-occurrence constraints, use the extended path expressions above.
The path expressions are interpreted as if they are type-aware Xpath 2 path expressions. If an implementation can only handle some simpler subset, such as Xpath 1, the implementation fails with an error at run time.
The path expressions may require more than streaming access. This is one issue which sets apart simple [validity] from [extended validity]
For other semantics, see the ISO Schematron spec, e.g. At http://www.schematron.com/
I would like to stress that the provision of Schematron in extension elements reduces lock-in. At some future stage, some bright people unknown could come up with some better system as yet undreamed of. At that time, the XSD WG can then adopt the new constraint system as the required extension, and obsolete Schematron. Compare this with the difficulty in, say, adding a new facet or changing the key and uniqueness constraints in monolithic XSD 1.0.
The provision of Schematron simplifies the task of XSD enhancement, because it gives a plausible workaround for rejected requirements to users. For example, a user who wants to specify that the top-level element must be “book”

A Call for Rapprochement between W3C XSD and ISO DSDL:A Non-Intrusive Extension Framework for XSD 1.1 to Support Schematron and Beyond