An Approach for Evolving XML Vocabularies
Using XML Schema

Noah Mendelsohn
IBM Corporation
June 15, 2004

ROUGH DRAFT: DAVID HAS ASKED ME TO GET THIS OUT ASAP. I’LL PROOFREAD TOMORROW AND REPUBLISH IF NECESSARY. COMMENTS ARE WELCOME. THANK YOU.

Introduction

At the May, 2004 face to face meeting of the XML Schema WG I discussed some ideas for managing evolving XML vocabularies using XML schema. This note is an attempt to summarize that proposal. These ideas remain at best incomplete, and they do not necessarily represent the preferred approach of my employer, IBM. I do hope they are helpful in facilitating workgroup discussions of XML versioning.

Assumptions & Rationale

In this exposition, the term “vocabulary” is used broadly and somewhat informally to refer to a user-defined XML language that is used for some purpose or other. It is assumed that a given version of such a vocabulary will be describable by some XML schema that can validate element information items conforming to the vocabulary. Though informal, this definition should suffice to introduce the design. Indeed, some of the characteristics of a vocabulary (I.e. what can be validated, what can be versioned, etc.) become clearer from the design itself. Note that a vocabulary is not necessarily a single namespace, and a single namespace may embody all or part of more than one vocabulary. Changes to a vocabulary may or may not be made using the same namespace(s) as the original.

There seems to be wide divergence of opinion as to how vocabularies will evolve in various scenarios, who will be involved in making changes, how often the same vocabulary will be revised over time, what sorts of incompatibilities may be introduced, which users will have which versions of a schema, how documents that use multiple evolving namespaces are to be managed, and so on. Some of the specific proposals that have been discussed make particular assumptions about the answers to these questions. A thesis of this proposal is that XML Schema can relatively easily be adapted to support a quite broad range of answers to the above questions.

Specifically, this proposal is based on the following assumptions and design goals:

· The same vocabulary may be versioned or fixed repeatedly. Accordingly, the design should be convenient to use even after 20 or 30 such revisions.

· The versioning mechanisms should not presume particular instance constructions such as <extension> elements.

· In some but not in all cases, forward and/or backward compatibility is be required: I.e. it should be possible but not essential to write early schemas that will somehow accept content that is not fully defined until later, and schemas for later versions will often but not always validate earlier forms of the vocabulary.

· Conversely, breaking changes should not be forbidden. For example, it may be that an early construct is deprecated at some later time, and perhaps completely disallowed eventually. Likewise, later versions may introduce constructs that are rejected outright by earlier ones.

· It should be possible to check for or force various sorts of forward or backward compatibility when desired.

· Schemas for versions of a vocabulary may but need not form a sequence or tree, in which later versions somehow directly reference particular schema documents for earlier versions. This choice allows for redefinition of the same vocabulary by multiple organizations or in multiple schema files, as XML does today. In this design, you can rewrite the schema for my vocabulary to adapt it for your own use (this may or may not be good common practice, but the versioning mechanisms do not prohibit it.)

· A consequence of the point above is that the schema for version x is not necessarily expressed as a delta to the schema for version x-1, if in fact the versions form a sequence at all. Such incremental definition schemes are convenient, but do not necessarily scale to the case where the same vocabulary is revised 20 or 30 times. In such a case one would need up to 30 schema documents to assemble the effective schema. Accordingly, this design allows for but does not require such incremental definition.

· As discussed above, no unnecessary assumptions should be made regarding the relationships between vocabularies and XML Namespaces. Often, a vocabulary will be expressed primarily as a single XML namespace. Often, to maintain forward and backward compatibility, that same namespace will be used in subsequent versions as well. Nothing in this design prohibits the use or coordinated evolution of multiple namespaces, the addition of new namespaces in subsequent versions of a vocabulary, etc.

Separation of Concerns and Goals of the Core Mechanism

Obviously, this design is targeted at a very flexible set of assumptions as to how XML is used. This is achieved through a more careful attention to separation of concerns than seems to be the case in some other proposals. Indeed, the key feature of this approach is that it’s core mechanisms are not fundamentally addressed at defining “versions”, but rather with these two goals:

1) Making it convenient to write a schema that, when validating, distinguishes in the PSVI content that is truly expected from content that is tolerated but not fully understood (I.e. to tolerate later changes to the vocabulary) and to distinguish both from content that is to be completely rejected.

2) Given two such schemas, making it convenient to check whether there is any content that one will allow that the other will not tolerate. This is to facilitate the situations in which you do want such compatibility.

Note that the above refers to schemas, not schema documents, so all of the above applies to mixed namespace scenarios, and independent of how many schema documents may have been imported or included to make the schema. The means by which this is achieved are discussed below under “Core Design”. We then allow for but do not require the invention and optional use of mechanisms that:

· Facilitate the writing of schema documents that embody incremental changes to earlier versions (e.g. mechanisms like <xsd:redefine>. Indeed, the design specifically allows for one to make say 5 or 6 successive changes using such an incremental mechanism, and then for cleanliness rewrite the next version of the schema document from scratch.

· Declare an intensional tree or other graph of schema documents or schemas to facilitate management of evolving definitions. These would include mechanisms such as version=”x.y” attributes, attributes asserting that one schema is an evolution of some other designated schema etc.

The next section discusses the means by which the core mechanism achieves the two goals set out above.

The Core Design

The two main goals of the core design are listed above. These are achieved as follows:

Distinguishing expected from tolerated or disallowed content using wildcards

XML schema provides a so-called “wildcard”, which is expressed in schema documents as <xsd:any>. The fundamental thesis of this design is that content truly expected by an application can usually be validated by a non-wildcard particle; wildcards can be used to designate places in the content model where additional content is tolerated to facilitate interoperation with other versions of the vocabulary.

The Unique Particle Attribution Constraint of XML schema ensures that applications can tell from the PSVI which content has been validated by a wildcard and which by an element declaration. Consider the following two schemas:

Version A:

<xsd:sequence> <xsd:element name=”x” type=”xsd:integer”/> <xsd:any minOccurs=”0” maxOccurs=”unbounded” processContents=”skip”/> </xsd:sequence>

Given the instance:

<x>123</x> <y>abc</y>

this schema will create a PSVI associating <x> with the element declaration and y with the wildcard. Now let’s assume that the reason this instance showed up was that it was created by an application that knew about schema version B:

Version B:

<xsd:sequence> <xsd:element name=”x” type=”xsd:integer”/> <xsd:element name=”y” type=”xsd:string”/> <xsd:any minOccurs=”0” maxOccurs=”unbounded” processContents=”skip”/> </xsd:sequence>

An application validating with version B accepts the same instance but associates y with the second element declaration in the PSVI. This application presumably has first-class knowledge of both x and y.

Weak Wildcards avoid UPA conflicts

Use of wildcards in roughly this manner has been considered on and off for years, but has until know been inhibited by UPA, which prohibits schemas such as the following:

<xsd:sequence> <xsd:element name=”x” type=”xsd:integer” minOccurs=”0”/> <xsd:any minOccurs=”0” maxOccurs=”unbounded” processContents=”skip”/> </xsd:sequence>

The above causes a UPA violation, because a single element <x> matches either the first or the second particle.

The schema workgroup has recently given serious consideration, for other reasons, to changing wildcards to behave in a so-called weak manner. This would not change the behavior of existing schemas, but would allow schemas such as the one shown above. In such cases, the explicit element declaration would always take precedence, thus removing the ambiguity. This proposal presumes the existence of weak wildcards in the schema design. Indeed, this is the only incompatible change that is absolutely required in comparison to XML Schema 1.0.

Given that we have weak wildcards for other reasons, it becomes possible to use wildcards freely at any point in a content model. Thus, this proposal allows at user discretion for extension of vocabularies not just at the end of each model, but anywhere that a weak wildcard can be used.

Possible further changes to facilitate wildcard use

The above analysis shows how weak wildcards can be used to facilitate construction of extensible content models. In fact there are at least two sorts of further changes to XML schema that we would want to consider in conjunction with this proposal:

· Our current wildcards allow content from any namespace, other namespaces, or a list of designated namespaces. These may not necessarily be the most useful options for our versioning scenarios. One proposal is to introduce a wildcard that would validate any element not explicitly declared elsewhere in the schema (regardless of namespace, or perhaps intersected with the existing namespace controls.) This supports an idiom in which: if I know about an element and I don’t explicitly call for it, that means I don’t want it. I personally think we would want to use something like this in the schema for schemas.

· Purely as a convenience, we could introduce wildcard defaulting mechanisms into the transfer syntax. So, we might have something like:

<xsd:schema … defaultExtensionModel=”{openAtEnd, openEverywhere}>

where “openAtEnd” causes by default a weak wildcard to appear at the end of every content model, and openEverywhere puts one between each particle. This is an admittedly vague proposal, but it’s just a convenience and we can decide later what if anything we want along these lines.

Comparing two schemas to determine subsumption

The sections above outline an approach that uses weak wildcards to facilitate the construction of schemas that distinguish tolerated, from expected, from disallowed content. The 2^nd goal is to facilitate checking of whether one schema will fail to tolerate any of the content allowed by another.

The schema workgroup has recently devoted significant effort to proving that the subsumption relation between any two content models can be tractably checked. That work was done to support a simplification of the rules for complexType refinement.

This design proposes to use the same subsumption algorithm to achieve our second goal. Specifically, given an element declaration from schema 1 and a declaration for a similarly named element from schema 2, the subsumption algorithm allows one to determine whether any content accepted by one of the schemas will be rejected by the other.

This design does not require that such checking be done. The assumption is that in many scenarios application developers will wish to enforce such discipline between evolving versions, and we show how development tools can do the necessary checking as schemas are developed and before they are deployed. Conversely, if a user wishes for whatever reason to introduce an intentional incompatibility, then a suitably written checker can be used to ensure that only the intended incompatibilities have been created.

Note also that nothing in the design as presented so far states that versions must evolve in linear or tree-like form. Indeed, completely independent organizations can create schemas that purport to validate similar or identical vocabularies, can check the degree to which the goal is achieved, and can deploy as appropriate. Debug versions can be checked against production versions, and so on.

Optional Features

As implied above, optional layers can be defined to meet additional goals such as the following:

· Facilitating the creation of one schema document based on another, particularly if the changes are small. We should see whether or not redefine meets the whole need.

· Facilitating the automatic insertion of weak wildcards to create content models that are by default open.

· Supporting some sort of standard labeling for versions (e.g. version=”x.y”). Note that the mechanisms above are oblivious to such labeling, but various schema document management and deployment systems may find them useful.

While not a separate layer, we should also consider the proposal above to:

· Provide new options on wildcards to accept only content likely to be used in evolution of a given vocabulary (e.g. from the current namespace but not explicitly declared in the current schema)

Conclusion

Careful readers will note that the only truly required incompatibility with XML Schema 1.0 is the introduction of weak wildcards, a change that is already contemplated. Other changes to facilitate version management or incremental development of schemas may also prove desirable, but are not strictly required. Accordingly, an interesting feature of this design is that it seems to achieve a broader range of goals than many others, using a single design change that has already been contemplated for other reasons. Indeed, that one change does not invalidate any existing schemas, but merely allows the use of wildcards in situations where they were previously prohibited.

Among the next steps I suggest are:

· Check this proposal against a broad range of use cases and with our user community.

· Check with the schema implementation community to guage their willingness to deploy weak wildcards and to deal with the attendant incompatibility with XML Schema 1.0.

· Consider the optional features above, whether they are worth implementing, and if so whether implementers of the schema language will support them.