W3C

[Editorial Draft] Extending and Versioning XML Languages Part 2: Schema Languages

Draft TAG Finding 24 November 2004

This version:
http://www.w3.org/2001/tag/doc/versioning-20041124
Latest version:
http://www.w3.org/2001/tag/doc/versioning.html
Previous version:
http://www.w3.org/2001/tag/doc/versioning-20031003
Editors:
David Orchard, BEA Systems, Inc. <David.Orchard@BEA.com>
Norman Walsh, Sun Microsystems, Inc. <Norman.Walsh@Sun.COM>

Abstract

@@rewrite

This document is the requested breakout of schema language specific discussion of extensibility. It is heavily XML schema based, but only because of scheduling. OWL/RDF and RelaxNG sections wil be added.

Status of this Document

This document has been developed for discussion by the W3C Technical Architecture Group. It does not yet represent the consensus opinion of the TAG.

Publication of this finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.

Additional TAG findings, both approved and in draft state, may also be available. The TAG expects to incorporate this and other findings into a Web Architecture Document that will be published according to the process of the W3C Recommendation Track.

Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org (archive).

Table of Contents

1 Introduction
2 Version Identification Strategies
    2.1 Version Strategy: all components in new namespace(s) for each version (#1)
    2.2 Version Strategy: all new components in new namespace(s) for each compatible version (#2)
    2.3 Version Strategy: All new components in existing or new namespace(s) for each compatible version(#3)
    2.4 Version Strategy: all new components in existing or new namespace(s) for each version and a version identifier(#4)
3 Indicating Incompatible changes
    3.1 Must Understand
    3.2 Type extension
    3.3 Substitution Groups
4 Determinism
5 Other technologies
6 Conclusion
7 References
8 Acknowledgements


1 Introduction

Extending and Versioning XML Languages Part 1 described extending and versioning XML with little focus on the schema language used. Part 2 focus on the schema language specific aspects of extending and versioning XML. The choices, decisions, and strategies described in Part 1 are augmented with schema instances herein.

2 Version Identification Strategies

2.1 Version Strategy: all components in new namespace(s) for each version (#1)

Using XML Schema, the name owner might like to write a schema such as:

The next version of the schema, with middle name added, might look like

This schema is illegal because the middle in the 2nd name namespace and the wildcard with ##other are non-deterministic. More on non-determinism in the next strategy. An alternative is to not have the wildcard at all, and rely upon subtyping for extension. But this prevents any kind of compatible evolution as both sides must have the new schema to understand the type. There language designer has to choose between allowing compatible extensibility/versioning OR incompatible extensibility when subtyping is used.

Because of the type extension determinism problem, the language designer cannot re-use the existing name definition. They must create a new schema without any reference to the previous schema.

The new namespace for all components does not allow compatible evolution by the language designer, unless they choose to put new components in a new namespace which is strategy #2. Additionally, the version 2 schema cannot re-use the existing type definition.

2.2 Version Strategy: all new components in new namespace(s) for each compatible version (#2)

We previously saw how re-use by importing and extending schemas with wildcards is not possible. In this strategy, the schema designer attempts to insert the new extension in the existing schema definition, like:

However, the determinism constraint of XML Schema, described in more detail later, prevents this from working. The problem arises in a version when an optional element is followed by a wildcard. In this example, this occurs when an optional element is added and extensibility is still desired. This is an ungentle introduction to the difference between extensibility and versioning. An optional middle name added into a subsequent version is a good example. Consumers should be able to continue processing if they don_t understand an additional optional middle name, and we want to keep the extensibility point in the new version. We can't write a schema that contains the optional middle name and a wildcard for extensibility. The previous schema schema is roughly what is desired using wildcards, but it is illegal because of the determinism.

The author has 4 options for the v2 schema for name and middle, listed below and detailed subsequently:

  1. optional middle, extensibility retained, but name type does not refer to middle;

  2. optional middle, extensibility is lost, name type refers to middle;

  3. required middle, extensibility retained, name type refers to middle but compatibility is lost (essentially strategy #1);

  4. no update to the Schema

If they leave the middle as optional and retain the extensibility point, the best schema that they can write is:

This is not a very helpful XML Schema change. The problem is that they cannot insert the reference to the optional mid:middle element in the name schema and retain the extensibility point because of the aforementioned Non-Determinism Constraint.

The core of the problem is that there is no mechanism for constraining the content of a wildcard. For example, imagine that ns1 contains foo and bar. It is not possible to take the SOAP schema_an example of a schema with a wildcard - and require that ns1:foo element must be a child of the header element and ns1:bar must not be a child of the header element using just W3C XML Schema constructs.   Indeed, the need for this functionality spawned some of the WSDL functionality.

They could decide to lose the extensibility point (option #2), such as

This does lose the possibility for forwards-compatible evolution.

The final option, #3, is adding a required middle. They must indicate the change is incompatible. A new namespace name for the name element can be created. This is essentially strategy #1, new namespace for all components.

The downsides of the 3 options for new components in new namespace name(s) design have been described. Additionally, the design can result in specifications and namespaces that are inappropriately factored, as related constructs will be in separate namespaces.

2.3 Version Strategy: All new components in existing or new namespace(s) for each compatible version(#3)

It is possible to create Schemas with additional optional components. This requires re-using the namespace name for optional components and special schema design techniques. The re-using namespace rule is:

Good Practice

Re-use namespace names Rule: If a backwards compatible change can be made to a specification, then the old namespace name SHOULD be used in conjunction with XML_s extensibility model.

It is important to note that that a new namespace name is not required whenever a specification evolves - strategies #1 and #2 - but rather a new namespace name can be required only if an incompatible change is made. Strategy #1 uses a new namespace for all existing components and any additions, Strategy #2 uses a new namespace for all additions. Strategy #3 re-uses namespaces for compatible extensions.

Good Practice

New namespaces to break Rule: A new namespace name is used when backwards compatibility is not permitted, that is software MUST break if it does not understand the new language components.

Earlier examples showed that it is not possible to have a wildcard with ##any (or even ##targetnamespace) following optional elements in the targetnamespace. The solution to this problem is to introduce an element in the schema that will always appear if the extension appears. The content model of the extensibility point is the element + the extension. There are two styles for this. The first was published in an earlier version of this Finding in December 2003. It uses an Extensibility element with the extensions nested inside. The second was published in July 2004, then updated on MSDN. It uses a Sentry or Marker element with extensions following it.

A name type with extension elements is

Because each extension in the targetnamespace is inside an Extension element, each subsequent target namespace extensions will increase nesting by another layer. While this layer of nesting per extension is not desirable, it is what can be accomplished today when applying strict XML Schema validation. It seems to at least this author that potentially having multiple nested elements is worthwhile if multiple compatible revisions can be made to a language. This technique allows validation of extensions in the targetnamespace and retaining validation of the targetnamespace itself.

The previous schema allows the following sample name:

The namespace author can create a schema for this type

The advantage of this design technique is that a forwards and backwards compatible Schema V2 can be written. The V2 schema can validate documents with or without the middle, and the V1 schema can validate documents with or without the middle.

Further, the re-use of the same namespace has better tooling support. Many applications use a single schema to create the equivalent programming constructs. These tools often work best with single namespace support for the _generated_ constructs. The re-use of the namespace name allows at least the namespace author to make changes to the namespace and perform validation of the extensions.

An obvious downside of this approach is the complexity of the schema design. Another downside is that changes are linear, so 2 potentially parallel extensions must be nested rather than parallel.

2.4 Version Strategy: all new components in existing or new namespace(s) for each version and a version identifier(#4)

Using a version identifier, the name instances would change to show the version of the name they use, such as:

The last example shows that the middle is now a mandatory part of the name. As with Design #2, the schema for the optional middle cannot fully express the content model. A schema for the mandatory middle is

A significant downside with using version identifiers is that software that supports both versions of the name must perform special processing on top of XML and namespaces. For example, many components _bind_ XML types into particular programming language types. Custom software must process the version attribute before using any of the _binding_ software. In Web services, toolkits often take SOAP body content, parse it into types and invoke methods on the types. There are rarely _hooks_ for the custom code to intercept processing between the _SOAP_ processing and the _name_ processing. Further, if version attributes are used by any 3rd party extensions_say mid:middle has a version_then the schema cannot refer to the correct middle.

3 Indicating Incompatible changes

3.3 Substitution Groups

Another mechanism for extending a type in XML Schema is substitution groups. Substitution groups enable an element to be declared as substitutable for another. This can only be used for incompatible extensions as the consumer must understand the substitution type. Substitution groups require that elements are available for substitution, so the name designer must have provided a name element in addition to the name type.

Substitution groups do allow a single extension author to indicate that their changes are mandatory. The limitations are that the extension author has now taken over the type_s extensibility. A visual way of imagining this is that the type tree has now been moved from the language designer over to the extensions author. And the language designer probably does not want their type to be _hijacked_.

However, this is not substantially different than an extension being marked with a _Must Understand_. In either case_with the extensions higher up in the tree (sometimes called top-typing) or lower in the tree (bottom-typing)_a new type is effectively created.

The difference is that there can only be 1 element at the top of an element hierarchy. If multiple mandatory extensions are added, then the only way to compose them together is at the bottom of the type because that is where the extensibility is.

Substitution groups do not allow a language designer and an extension author to incompatibly change the language as they end up conflicting over what to call the name element. Thus substitution groups are a poor mechanism for allowing an extension author to indicate that their changes are incompatible. A Must Understand flag is a superior method because it allows multiple extension authors to mix their mandatory extensions with a language designer_s versioning strategy. Hence language designers should prevent substitution groups and provide a Must Understand flag or other model when they wish to allow 3rd parties to make incompatible changes.

In some cases, a language does not provide a Must Understand mechanism. In the absence of a Must Understand model, the only way to force consumers to reject a message if they don_t understand the extension namespace is to change the namespace name of the root element, but this is rarely desirable.

4 Determinism

This Finding has spent considerable material describing deterministic content models, and so it is worthy of describing the W3C XML Schema determinism rules in more detail. The reader is reminded that these rules are unique to W3C XML Schema and other XML Schema languages like RELAX NG do not use these rules and so do not suffer from the contortions one is forced through when using W3C XML Schema. XML DTDs and W3C XML Schema have a rule that requires schemas to have deterministic content models. From the XML 1.0 specification,

_For example, the content model ((b, c) | (b, d)) is non-deterministic, because given an initial b the XML processor cannot know which b in the model is being matched without looking ahead to see which element follows the b._

The use of ##any means there are some schemas that we might like to express, but that aren_t allowed.

Good Practice

Be Deterministic rule: Use of wildcards MUST be deterministic. Location of wildcards, namespace of wildcard extensions, minOccurs and maxOccurs values are constrained, and type restriction is controlled.

As shown earlier, a common design pattern is to provide an extensibility point_not an element - allowing any namespace at the end of a type. This is typically done with <xs:any namespace=_##any_>.

Determinism makes this unworkable as a complete solution in many cases. Firstly, the extensibility point can only occur after required elements in the original schema, limiting the scope of extensibility in the original schema. Secondly, backwards compatible changes require that the added element is optional, which means a minOccurs=_0_. Determinism prevents us from placing a minOccurs=_0_ before an extensibility point of ##any. Thus, when adding an element at an extensibility point, the author can make the element optional and lose the extensibility point, or the author can make the element required and lose backwards compatibility.

5 Other technologies

The W3C XML Schema Working has heard and taken to heart many of these concerns. They have plans to remedy some of these issues in XML Schema 1.1 [21]. They currently are looking at a _weak wildcard_ model, which solves some but not all of the problems. There is no public Working Draft of a Schema 1.1 with improved extensibility or versioning at the time of writing this Finding.

A simple analysis of doing compatible extensibility and versioning using RDF and OWL is available [21]. In general, RDF and OWL offer superior mechanisms for extensibility and versioning. RDF and OWL explicitly allow extension components to be added to components. And further, the RDF and OWL model builds in the notion of _Must Ignore Unknowns_ as an RDF/OWL processor will absorb the extra components but do nothing with them. An extension author can require that consumers understand the extension by changing the type using a type extension mechanism.  

RELAX NG is another schema language. It explicitly allows extension components to be added to other components as it does not have the non-determinism constraint.

6 Conclusion

This Finding describes a number of questions, decisions and rules for using XML, W3C XML Schema, and XML Namespaces in language construction and extension. The main goal of the set of rules is to allow language designers to know their options for language design, and ideally make backwards- and forwards-compatible changes to their languages to achieve loose coupling between systems.

7 References

FOLDOC
Free Online Dictionary of Computing. (See http://wombat.doc.ic.ac.uk/foldoc/.)
FlexXMLP
Flexible XML Processing Profile. (See http://www.upnp.org/download/draft-goland-fxpp-01.txt.)
MIME
RFC 1521, MIME. (See http://www.ietf.org/rfc/rfc1521.txt.)
HTML 2.0
RFC 1866, HTML 2.0. (See http://www.ietf.org/rfc/rfc1866.txt.)
WebDAV XMLIgnore post
Yaron GolandXML Ignore proposed for WebDAV (See http://lists.w3.org/Archives/Public/w3c-dist-auth/1997AprJun/0190.html.)
WebDAV
RFC 2518, WebDAV (See http://www.ietf.org/rfc/rfc2518.txt.)
HTML 4.0
HTML 4.0. (See http://www.w3.org/TR/1998/REC-html40-19980424/.)
TBL Mandatory Extensions
Berners-Lee. Web Architecture: Mandatory extensions. (See http://www.w3.org/DesignIssues/Mandatory.html.)
TBL Extensible languages
Berners-Lee. Web Architecture: Extensible languages. (See http://www.w3.org/DesignIssues/Extensible.html.)
TBL Evolution
Berners-Lee. Web Architecture: Evolvability. (See http://www.w3.org/DesignIssues/Evolution.html.)
Web Architecture: Extensible Languages
Berners-Lee and Connolly, ed. Web Architecture: Extensible Languages World Wide Web Consortium, 1998. (See http://www.w3.org/TR/1998/NOTE-webarch-extlang-19980210.)
HTML Document types
Connolly, ed. HTML Document dialects World Wide Web Consortium, 1996. (See http://www.w3.org/MarkUp/WD-doctypes.)
SOAP 1.2
W3C Recommendation, SOAP 1.2 Part 1: Messaging Framework (See http://www.w3.org/TR/SOAP/.)
WSDL 1.1
W3C Note, WSDL 1.1 (See http://www.w3.org/TR/WSDL/.)
XML 1.0
W3C Recommendation, XML 1.0 (See http://www.w3.org/TR/REC-xml.)
XInclude
W3C Working Draft, XML Inclusions (See http://www.w3.org/TR-Xinclude.)
XML Namespaces
W3C Recommendation, XML Namespaces (See http://www.w3.org/TR/REC-xml-names.)
XML Schema Part 2
W3C Recommendation, XML Schema, Part 2 (See http://www.w3.org/TR/xmlschema-2.)
XML Schema Wildcard Test Collection
XML Schema Wildcard Test collection (See http://www.w3.org/XML/2001/05/xmlschema-test-collection/result-ms-wildcards.htm.)
XFront Schema Best Practices
XFront Schema Best Practices (See http://www.xfront.com/BestPracticesHomepage.html.)
XML.com Schema Design Patterns
Dare ObasanjoXML.com Schema design patterns (See http://www.xml.com/pub/a/2002/07/03/schema_design.html.)
Dave Orchard writings on Extensibility and Versioning
Dave Orchard writings on extensibility and versioning (See http://www.pacificspirit.com/Authoring/Compatibility.)

8 Acknowledgements

The author thanks the many reviewers that have contributed to the article, particularly David Bau, William Cox, Ed Dumbill, Chris Ferris, Yaron Goland, Hal Lockhart, Mark Nottingham, Jeffrey Schlimmer, Cliff Schmidt, and Norman Walsh.