W3C

[Editorial Draft] Versioning for Arch Doc

Proposed Text for Arch Doc 03 Oct 2003

This version:
http://www.w3.org/2001/tag/doc/versioning-20031003
Previous version:
http://www.w3.org/2001/tag/doc/versioning
Editors:
David Orchard, BEA Systems, Inc. <David.Orchard@BEA.com>
Norman Walsh, Sun Microsystems, Inc. <Norman.Walsh@Sun.COM>

Abstract

This is the text that we propose for section 4.5 of the webarch doc.

Status of this Document

This document is a hacked shell for our proposed text.

Table of Contents

1 Extensibility and Versioning
    1.1 Terminology
    1.2 Why Extend languages?
    1.3 Identifying and Controlling Languages
    1.4 Understanding Extensions
    1.5 Versioning Languages
    1.6 Namespace content changes


1 Extensibility and Versioning

The primary motivation to allow instances of a language to be extended is to decentralize the task of designing, maintaining, and implementing extensions. It allows senders to change the instances without going through a centralized authority. to a great extent, the Web rose dramatically in popularity because decentralized extensions to HTML, HTTP and URLs were all possible. Each language provided explict extensibility points and rules for understanding extensions that enabled the decentralized evolution of the languages. The support for the languages and their related rules are codified in a continuously evolving set of software agents. The allowance for agents to continuously evolve, without a "big bang" upgrade, dir

It is almost unheard of for a single version of a language to be deployed without requiring some kind of augmentation over time. Knowing that a language will not be all things to all people, a language designer can allow parties to extend instances of the language or the language itself. Typically the language designer will specify where extensions in the instance and extensions in the language are allowed.

As documents, or messages, are exchanged between agents, they are processed. Most agents are designed to discriminate between valid and invalid inputs. In order to have any sort of interoperability, a language must be defined or described in some normative way so that the terms "invalid" and "valid" have meaning.

There are a variety of tools that might be employed for this purpose (DTDs, W3C XML Schema, RELAX NG, Schematron, etc.). These tools might be augmented with normative prose documentation or even some agent-specific validation logic. In many cases, the schema language is the only validation logic that is available.

Whether you've deployed your agent on ten machines, or a hundred, or a million, if you change a language in such a way that all those appications will consider instances of the new language invalid, you've introduced a versioning problem with real costs.

Once a language is used outside of its development environment, there will be some cost associated with changing it: software, user expectations, and documentation may have to be updated to accomodate the change. Once a language is used in multiple environments, any changes made will introduce multiple versions of the language.

1.1 Terminology

Extensibility is a property that enables evolvability of software. It is perhaps the biggest contributor to loose coupling in systems as it enables the independent and potentially compatible evolution of languages. Languages are defined to be [Definition: Extensible if instances of the language can include terms from multiple vocabularies. ]. An extensible language is one with some syntax reserved for future use. To extend a language is to define the syntax for some of the reserved parts.

[Definition: A language is an identifiable set of vocabulary terms that has defined constraints.] For example, the elements and attributes of XHTML 1.0 or the names of built-in functions in XPath 2.0.The syntactic structure of the language is constrained by the use of DTDs, XML Schema, other schema languages or narrative constraints expressed in the relevant language specification. By language, we just mean the set of elements and attributes, or components, used by a particular agent.

A language has one or more vocabularies. [Definition: A vocabulary is a set of terms]. In general, the intended meaning of a vocabulary term is scoped by the language in which the term is found. However, there is some expectation that terms drawn from an XML Namespace have a consistent meaning across all languages in which they are used.

An XML Namespace is a convenient container for collecting terms that are intended to be used together within a language or across languages. It provides a mechanism for creating globally unique names.

[Definition: An instance is a realization of a language]. Documents are instances of a language. They must have a root element in XML.

[Definition: Content is data that is part of an instance of a language.] Content has one or more components.

[Definition: a Component is a realization of a term in a language.] XML elements and attributes are components. As a term has a name and the language has a namespace name, each component as QName, that is the combination of the namespace name and the name.

The interaction between agents and languages is described in terms of senders and receivers. [Definition: A sender is an agent that creates or produces an instance and sends it to another agent for processing.][Definition: A receiver is an agent that consumes an instance that it obtained from a sender.]

These terms and their relationships are shown below

UML diagram of language terms

[Definition: A language change is backwards compatible if newer agents can process all instances of the old language. ] A software example is a word processor at version 5 being able to read and process version 4 documents. A schema example is a schema at version 5 being able to validate version 4 documents. In the case of Web services, this means that new Web services receivers, ones designed for the new version, will be able to process all instances of the old language. This means that a sender can send an old version of a message to a receiver that understands the new version and still have the message successfully processed.

[Definition: A language change is forwards compatible if older agents can process all instances of the newer language.] An example is a word processing software at version 4 being able to read and process version 5 documents. A schema example is a schema at version 4 being able to validate version 5 documents. In the case of Web services, this means that existing Web service receivers, designed for a previous version of the language, will be able to process all instances of the new language. This means that a sender can send a newer version of a message to an existing receiver and still have the message successfully processed.

In broad terms, backwards compatibility means that newer senders can continue to use existing services, and forwards compatibility means that existing senders can use newer services

The cost of changes that are not backward or forward compatible is often very high. All the software that uses the language must be updated to the newer version. The magnitude of that cost is directly related to whether the system in question is open or closed.

[Definition: A closed system is one in which all of the senders and receivers are more-or-less tightly connected and under the control of a single organization.] Closed systems can often provide integrity constraints across the entire system. A traditional database is a good example of a closed system: all of the database schemas are known at once, all of the tables are known to conform to the appropriate schema, and all of the elements in the each row are known to be valid for the schema to which the table conforms.

From a versioning perspective, it might be practical in a closed system to say that a new version of a particular language is being introduced into the system at such and such a time and all of the data that conforms to the previous version of the schema will be migrated to the new schema.

[Definition: An open system is one in which some senders and receivers are loosely connected or are not controlled by the same organization. The internet is a good example of an open system.]

1.3 Identifying and Controlling Languages

Some changes make a language completely incompatible with previous versions. Changes can also be backwards and forwards compatible. Designing languages to support compatible changes reduces the cost of those changes.

In an open system, it's simply not practical to handle language evolution with universal, simultaneous, atomic upgrades to all of the software components. Existing senders and recievers outside the immediate control of the organization that's publishing a changed language will continue to use the previous version for some (possibly long) period of time.

Finally, it's important to remember that systems evolve over time and have different requirements at different stages in their life cycle. During development, when the first version of a language is under active development, it may be valuable to persue a much more aggressive, draconian versioning strategy. After a system is in production and there is an expectation of stability in the language, it may be necessary to proceed with more caution. Being prepared to move forward in a backwards and forwards compatible manner is the strongest argument for worrying about versioning at the very beginning of a project.

Controlling the evolution of a language relies on two assumptions:

  1. The agent must understand the semantics of every valid message that it receives. We must therefore define the semantics of messages that contain new elements or attributes.

  2. We assume that each service rejects invalid messages. Therefore, it must be possible for our language to evolve without changing the schema that we've defined for it. New versions of a service might be deployed with newer schemas, but we want these new services to be able to communicate with the already deployed senders and receivers that will continue to use the old schemas. That is why forwards compatible language changes have to be possible without changing the schema.

In order for a schema to be extensible in the way described above, to allow new elements or attributes to be added without changing the schema, the schema must allow extension in any namespace. This brings us to the next rule for enabling a must ignore versioning strategy in XML languages:

Good Practice

Any Namespace: Every language SHOULD provide for extension in any namespace.

It usually makes sense to allow extension in attributes as well.

Good Practice

Full Extensibility: All XML Elements that can allow attributes, ie ComplexTypes in XML Schema, SHOULD allow any attributes and any elements in their content models.

The corollary of extensibility in any namespace, including the language's namespace, is that a namespace does not identify a single version of a language or set of names. A namespace identifies a compatible set of names.

Good Practice

Namespace identifies compatible names: The namespace name identifies names that are compatible within the same namespace name.

Given that a namespace name is not for a single version of a language or set of names, it may be useful to identify the particular version. An example would be specifying in a policy statement the exact language supported by a software agent. This use of version identification could be considered each compatible "minor" version, with the namespace name identifying the incompatible versions.

Good Practice

Identify specific version with version attribute: The specific version of a set of names within a given namespace may be identified with a version attribute to differentiate between the compatible versions

1.4 Understanding Extensions

The key value of the extension strategy described above is that existing XML documents can be extended without having to change existing implementations. For languages that are intended to be extensible, specifications SHOULD provide a clear processing model for extensions.

Good Practice

Provide Processing Model: Languages MUST provide a processing model for dealing with extensions.

Given that an existing agent cannot possibly know the intended semantics of a component that its never seen before, only one semantic is possible: ignore that component. We propose, therefore, that agents "must ignore" elements and attributes they do not recognize.

For many agents, including most Web services, the most practical rule is: must ignore.

Good Practice

Must Ignore: Receivers MUST ignore any XML attributes or elements that they do not recognize in a valid XML document.

This rule does not require that the elements be physically removed; only ignored for most processing purposes. It would be reasonable, for example, if a logging agent included unrecognized elements in its log. There are cases where the elements should not be physically removed. An example is an agent that forwards the content to another receiver should preserve the unknown content.

HTTP 1.1 is an example of a language that specifies that receivers should ignore any headers that it doesn't understand. RFC 2616 says "Unrecognized header fields SHOULD be ignored by the recipient and MUST be forwarded by transparent proxies."

There are two broad types of languages relating to dealing with extensions. These two types are presentation or document and data oriented agents. For data oriented agents, such as Web services, the rule is:

Good Practice

Must Ignore All: The Must Ignore rule applies to unrecognized elements and their descendents.

agents must deal carefully with the ignored elements, especially if any of them are counted or if the agent makes use of information about their position.

Document oriented languages need a different rule as the agent will still want to present the content of an unknown element. The rule for document oriented agents is:

Good Practice

Must Ignore Container: The Must Ignore rule applies only to unrecognized elements

This retains the descendents of the ignored container element, such as for display purposes.

An example of the Must Ignore Container is HTML 1.1, 2.0 and 3.2. They specify that any unknown start tags or end tags are mapped to nothing during tokenization.

In order to accomodate big bang changes when they are needed, the must ignore rule is not expected to apply to the root element. If the document root is unrecognized, the entire message must be rejected.

1.6 Namespace content changes

Only the owner of a namespace can change (ie. version) the meaning of elements and attributes in that namespace.

Constraint

Only Namespace Owners Change Namespace: The namespace name owner is the only entity that is allowed to change the meaning of names in a namespace.

There is a school of thought that says that every extension should be placed in a separate namespace; that after publication, no new names should be added to a namespace. If you hold that point of view then you may not feel that an extensibility element is necessary or desirable.

Another school of thought says that the maintainers of the language have a right to add new names to a namespace as they see fit. There are certain advantages associated with adding new names in the same namespace.

  1. It reduces the number of namespaces needed to describe instances of the document. There are significant convenience advantages to using defaulted namespaces for document creation and manipulation.

  2. It provides a clear separation between extensions by the language designers and extensions by third parties.

  3. There may be additional benefits in code generation and reuse if single namespace or a small set of namespaces can completely describe the language.

A namespace name owner will use the lifecycle of the namespace as one of the factors in determining whether to revise the namespace or not. Typically, the changes during development are not compatible changes. The author of namespaces that are under development will typically follow a "big bang" approach. This helps reduce the number of potentially buggy or immature implementations that are deployed. A W3C specification is a good illustrative example. It will probably change namespace names for each working draft. The Candidate Recommendation, Proposed Recommendation and Recommendation namespaces names should only be changed if compatibility is not achieved.