W3C

[Editorial Draft] Extending and Versioning Languages Part 1

Draft TAG Finding 26 March 2007

This version:

http://www.w3.org/2001/tag/doc/versioning-20070326.html ( xml )

Latest version:

http://www.w3.org/2001/tag/doc/versioning

Previous versions:

Unapproved Editors Drafts: http://www.w3.org/2001/tag/doc/versioning-20060726.html, http://www.w3.org/2001/tag/doc/versioning-20060717.html, http://www.w3.org/2001/tag/doc/versioning-20060710.html, http://www.w3.org/2001/tag/doc/versioning-20031116.htmlhttp://www.w3.org/2001/tag/doc/versioning-20031003.html

Editor:

David Orchard, BEA Systems, Inc. <David.Orchard@BEA.com>


Abstract

This document provides terminology for discussing language versioning, a number of questions that language designers must answer, and a variety of version identification strategies. A separate document contains XML language specific discussion.

Status of this Document

This document has been developed for discussion by the W3C Technical Architecture Group. It does not yet represent the consensus opinion of the TAG.

Publication of this finding does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time.

Additional TAG findings, both approved and in draft state, may also be available. The TAG expects to incorporate this and other findings into a Web Architecture Document that will be published according to the process of the W3C Recommendation Track.

Please send comments on this finding to the publicly archived TAG mailing list www-tag@w3.org (archive).

Table of Contents

1 Introduction
    1.1 Terminology
        1.1.1 Compatibility
            1.1.1.1 Composition
        1.1.2 Partial Understanding
        1.1.3 Divergent Understanding and Compatibility
        1.1.4 Open or Closed systems
        1.1.5 Compatibility of languages vs compatibility of applications
    1.2 Why Worry About Extensibility and Versioning?
    1.3 Why Do Languages Change?
    1.4 How Do Languages Change?
        1.4.1 Why Extend languages?
    1.5 Kinds of Languages
2 Versioning Strategies
    2.1 Versioning Designs
        2.1.1 Big Bang/Incompatible
        2.1.2 Forwards Compatible
            2.1.2.1 Must Ignore Unknowns
            2.1.2.2 Fallback Provided
            2.1.2.3 Supporting functionality
        2.1.3 Backwards compatible
            2.1.3.1 Replacement
            2.1.3.2 Side-by-side
        2.1.4 Mixtures
    2.2 Why Have a Strategy?
3 Language Requirements
    3.1 What language form
    3.2 Can 3rd parties extend the language?
    3.3 Can 3rd parties extend the language in a compatible way?
    3.4 Can 3rd parties extend the language in an incompatible way?
    3.5 Can the designer extend the language in a compatible way?
    3.6 Can the designer extend the language in an incompatible way?
    3.7 Is the vocabulary a stand-alone language or an extension of another vocabulary?
    3.8 What Schema language(s)?
    3.9 Should extensions or versions be expressible in the Schema language?
    3.10 Requirements Summary
4 Language Design
    4.1 Schema language design choices or constraints.
    4.2 Substitution Mechanism.
    4.3 Component identification
    4.4 Identification of incompatible extensions
    4.5 Design Summary
5 Identifying and Extending Languages
    5.1 Version Numbers
    5.2 XML Namespaces
6 Case Studies
    6.1 HTML
    6.2 XML
    6.3 CSS
    6.4 Microformats
7 Extension versus Versioning
8 Conclusion
9 References
10 Acknowledgements


1 Introduction

The evolution of languages by adding, deleting, or changing syntax or semantics is called versioning. Making versioning work in practice is one of the most difficult problems in computing. Arguably, the Web rose dramatically in popularity because evolution and versioning were built into HTML and HTTP. Both systems provide explicit extensibility points and rules for understanding extensions that enable their decentralized extension and versioning.

This finding describes general problems of and techniques in for evolving systems in compatible and incompatible ways. These techniques are designed to allow compatible changes with or without schema propagation. A number of questions, design patterns and rules are discussed with a focus towards enabling versioning in XML vocabularies, making use of XML Namespaces and XML Schema constructs. This includes not only general rules, but also rules for working with languages that provide an extensible container model, notably SOAP.[skw1] 

1.1 Terminology

The terminology for describing languages, producers, consumers, information, constraints, syntax, evolvability etc. follows. Let us consider an example. Two or more systems need to exchange name information. Names may not be the perfect choice of example because of internationalization reasons, but it resonates strongly with a very large audience. The Name Language is created to be exchanged. [Definition: A producer is an agent that creates text. ] Continuing our example, Fred is a producer of Name Language text. [Definition: An Act of Production is the creation of text.]. A producer produces text for the intent of conveying information. When Fred does the actual creation of thecreates some text, that is an act of production. [Definition: A consumer is an agent that consumes text.] We will use Barney and Wilma as consumers of text. [Definition: An Act of Consumption is the processing of text of a language.] Wilma and Barney consume the text separately from each other, each of these being a consumption eventseparate acts of consumption. A consumer is impacted by the instance that it consumes. That is, it interprets that instance and bases future processing, in part, on the information that it believes was present in that instance. Text can be consumed many times, by many consumers, and have many different impacts.

[Definition: A Language consists of a set of texts, any syntactic constraints on the text[skw2] , a set of information, any semantic constraints on the information, and the mapping between texts and information. ][Definition: Text is a specific, discrete sequence of characters]. Given[skw3]  that there are constraints on a language, any particular text may or may not have membership in a language. Indeed, a particular string of characters may be a member of many languages, and there may be many different strings of characters that are members of a given language. The texts of the language are the units of exchange. Documents are texts of a language. The Name Language consists of text set that have 3 terms and specifies syntactic constraints: that a name consists of a given and a family. [Definition: A language has a set of constraints that apply to the set of strings in the language. ] These constraints can be defined in machine processable syntactic constraint languages such as XML Schema, microformats, human readable textual descriptions such as HTML descriptions, or are embodied in software. [skw4] Languages may or may not be defined by a schema in any particular schema language. The constraints on a language determine the strings that qualify for membership in the language. Vocabulary terms contribute to the set of strings, but they are not the only source of characters to the set of strings in a given language. The language strings may include characters outside of terms, such as punctuation. One reason for additional characters is to distinguish or separate terms, such as whitespace and markup. [skw5] 

Example 1: Name examples.

<name>

  <given>Dave</given>

  <family>Orchard</family>

</name>

 

name="Dave Orchard"

 

<span class="fn">Dave Orchard</span>

 

urn:nameschem:given:Dave:family:Orchard

The set of information in a language almost always has semantics. In the Name Language, given and family have the semantics of given and family names of people. The language also has the binding from the items in the information set to the text set. Any potential act of interpretation[skw6] , that is any consumption or production, conveys information from text according to the language's binding[skw7] . The language is designed for acts of interpretation, that being the purpose of languages. In our example, this mapping is obvious and trivial, but many languages it is not. Two languages may have the exact same strings but different meanings for them[skw8] . In general, the intended meaning of a vocabulary term is scoped by the language in which the term is found. However, there is some expectation that terms drawn from a given vocabulary will have a consistent meaning across all languages in which they are used. Confusion often arises when terms have inconsistent meaning across languages. The Name terms might be used in other languages, but it is generally expected that they will still be "the same" in some meaningful sense.

These terms and their relationships are shown below

Diagram of language terms[skw9] 

We say that Fred engages in an Act of Production that results in a Name Instance with respect to Name Language V1. The Name Instance is in the set of Name V1 Texts, that is the set of strings in the Name Language V1. The production of the Name Instance has the intent of conveying Information, which we call Information 1. This is shown below:

Production instance

We say that Barney engages in an Act of Consumption of a Name Instance with respect to Name Language V1. The consumption of the Name Instance has the impact of conveying Information 1. This is shown below:

Production and consumption instance

Versioning is an issue that effects almost all applications eventually. Whether it's it is a processor styling documents in batches to produce PDF files, Web services engaged in financial transactions, HTML browsers, the language and instances will likely change over time. The versioning policies for a language, particularly whether the language is mutable or immutable, should be specified by the language owner. Versioning is closely related to extensibility as extensible languages may allow different versions of instances [skw10] than those known by the language designer. Applications may receive versions of a language that they aren't expecting[skw11] .

If a Name Language V2 exists, with its set of strings and Information set, Wilma may consume the same Name Instance but with respect to the Name Language V2 and have impact of Information 2. Name Language V2 relates to V1 by relationship r2, which is forwards compatible comparing language V1 to V2 instances[skw12] , and backwards compatible comparing language V2 to V1 instances. Similarly, Information 2 - as conveyed by Consumption 2 - relates to Information 1 - as conveyed by Consumption 1 - by relationship r1.

Production and 2 Consumptions Instance

Extensibility is a property that enables evolvability of software. It is perhaps the biggest contributor to loose coupling in systems as it enables the independent and potentially compatible evolution of languages. Languages are defined to be [Definition: Extensible if the syntax of a language allows information that is not defined in the current version of the language.]. [skw13] The Name Language is extensible if it can include terms that aren't defined in the language, like a new middle term.

1.1.1 Compatibility

As languages evolve, it is possible to speak of backwards and forwards compatibility. A language change is backwards compatible if newer processors[skw14]  can process all instances of the old language. Backwards compatibility means that a newer version of a consumer can be rolled out in a way that does not break existing producers[skw15] . A producer can send an older version of a message to a consumer that understands the new version and still have the message successfully processed.[skw16]  A software example is a word processor at version 5 being able to read and process version 4 documents. A schema example is a schema at version 5 being able to validate version 4 documents. This means that a producer can send an old version of a message to a consumer that understands the new version and still have the message successfully processed[skw17] . In the case of Web services, this means that new Web services consumers, ones designed for the new version, will be able to process all instances of the old language.

A language change is forwards compatible if older processors can process all instances of the newer language. Forwards compatibility means that a newer version of a producer can be deployed in a way that does not break existing consumers. [skw18] Of course the older consumer will not implement any new behavior, but a producer can send a newer version of an instance [skw19] and still have the instance successfully processed. An example is a word processing software at version 4 being able to read and process version 5 documents. A schema example is a schema at version 4 being able to validate version 5 documents. This means that a producer can send a newer version of a message[skw20]  to an existing consumer and still have the message successfully processed. In the case of Web services, this means that existing Web service consumers, designed for a previous version of the language, will be able to process all instances of the new language.

In general, backwards compatibility means that existing texts can be used by updated consumers, and forwards compatibility means that newer texts can be used by existing consumers. Another way of thinking of this is in terms of message exchanges. Backwards compatibility is where the consumer is updated and forwards compatibility is where the producer is updated, as shown below[skw21] :

Example 2: Evolution of Producers and/or Consumers

Versioning Graphic[skw22] 

With respect to consumers and producers, backwards compatibility means that newer consumers can continue to use existing producers, and forwards compatibility means that existing consumers can be used by newer producers. [skw23] 

We[skw24]  need to be more precise in our definitions of what parts of our definitions are compatible with what other parts. Every language has a Defined Text set, which contains only Texts that contain the texts explicitly defined by the language constraints. Typically, a language will define a mapping from each of the definitions to information[skw25] . Each language has an Accept Text set, which contains texts that are allowed by the language constraints. Typically, the Accept Text set contains Texts that are not in the Accept Defined Text set and do not have a mapping to information. For example, a language that has a syntax that says names consists of given followed by family followed by anything. A text that consists of a name with only a given and a family falls in the Defined and Accept Text set. A text that consists of a name with a given, a family and an extension such as a middle falls in the Accept Text set but not the Defined text set[skw26] . By definition, the Accept Text set is a superset of the Defined Text set.

We have discussed backwards and forwards compatibility in general, but there other flavours of compatibility, based upon compatibility between the Accept Text set, Defined Text set and Information conveyed. Syntactic compatibility is compatibility that is wrt the Texts only, not the information conveyed. Because languages have Accept and Defined Text sets, some producers will adhere to the Defined Text set, and others may generate extensions that fall in the Accept Text set. Compatibility with Producers that produce only Defined Text sets is called "strict" compatibility. Compatibility with Producers that may produce Texts in the Accept Text Set that are not in the Defined Text Set is called "full" compatibility. [skw27] 

A more precise definition of compatibility is with respect to the texts, that is whether all the texts in one language are also texts in another language. Another precise form of compatibility is with respect to the information conveyed, that is whether the information conveyed by a text in one language is conveyed by the same text interpreted in another language. The texts could be compatible but the information conveyed is not compatible. For example, the same text could mean different and incompatible things in the different language. Most systems have different layers of software, each of which can view a text differently and affect compatibility. For example, the XML Schema PSVI view is different from the actual text. We can also differentiate between language compatibility and application compatibility. While it is often the case that they are directly related, sometimes they are not, that is 2 languages may be compatible but an application might be incompatible with one of them.

We provide mathematical definitions of a text's compatibility based up on our terminology.

·         Let L1 and L2 be Languages, where L2 is introduced "after" L1.

·         Let T be a text.

·         T is in L1 iff (T is valid per L1 | T is in L1's set of Texts[skw28] ).

·         Let I1 be the information conveyed by Text T1 per language L1.

·         Let I2 be the information conveyed by Text T per language L2.

·         Text T is "fully compatible[skw29] " with language L2 if and only if I1 is compatible with I2 and (T is valid per L2 | T is in L2's set of Texts).

·         Text T is incompatible if any of the information in I2 is wrong (I.e. replaces a value in I1 with a different one) | (T is invalid per L2 | T is not in L2's set of Texts).

We can also provide mathematical definitions of language compatibility:

·         L2 is "fully backwards compatible" with L1 if every text in L1 Accept Text set is fully compatible with L2.

·         L2 is "strictly backwards compatible" with L1 if every text in L1 Defined Text set is fully compatible with L2.

·         L2 is "strictly backwards incompatible" with L1 if any text in L1 Defined Text set is incompatible with L2.

·         L1 is "fully forwards compatible" with L2 if every text in L2 Accept Text set is fully compatible with L1.

·         L1 is "strictly forwards compatible" with L2 if every text in L2 Defined Text set is fully compatible with L1.

·         L1 is "forwards incompatible with L2" if any text in L2 Defined Text set is incompatible with L1.

·         And combined together is: L1 is strictly compatible with L2 if if every text in L2 Defined Text set is fully compatible with L1 AND if every text in L1 Defined Text set is fully compatible with L2[skw30] .

We can draw a few conclusions. Given L2 is strictly backwards incompatible with L1 if any text in L1 Defined Text set is incompatible with L2, the only way that L2 can be backwards compatible with L1 is if the L2 Defined Text Set is a superset of L1 Defined Text set. Roughly, that means the addition of optional items in L2. Given L1 is "fully forwards compatible" with L2 if every text in L2 Accept Text set is fully compatible with L1, the only way that L1 can be forwards compatible with L2 is if the L1 Accept Text is is a superset of the L2 Accept Text set. Roughly, that means L1 allows all of L2 and more. It is this superset relationship that is a key to forwards compatibility, the allowing of texts by L1 that will become defined in L2.

Compatibility can be restated in terms of superset/subset relationships[skw31] .

·         Language L2 is strictly backwards compatible with Language L1 if L2 Defined Text set > (superset) L1 Defined Text Set AND every text in L1 Defined Text set is compatible with L2[skw32] .

·         Language L1 is strictly forwards compatible with Language L2 if L1 Accept Defined Text set > (superset) Language L2 Accept Defined Text set AND every text in L2 Accept Defined Text set is compatible with L1.[skw33] 

·         Language L2 is fully strictly compatible with Language L1 if L1 Accept Text set > (superset) Language L2 Accept Text set > (superset) L2 Defined Text set > (superset) L1 Defined Text Set AND every text in L1 Defined Text set is compatible with L2 AND every text in L2 Accept Text set is compatible with L1.

We have shown that forwards and backwards compatibility is only achievable through extensibility,[skw34]  and compatible versioning is a process of gradually increasing the Defined Text Set, reducing the Accept Text Set and ensuring the information conveyed is compatible, If ever the set relationships defined earlier do not hold, then the versions are not compatible.

1.1.1.1 Composition

Many languages are compound languages consisting of multiple languages. For example, a purchase order language could use the name language for names. The forwards, backwards and full compatibility definitions account for composition of languages because the used languages defined and accept sets are incorporated into the language. For example, the purchase order language Accept Set is the Accept Set of all the items defined OR used by the Purchase Order language, which includes the Accept Set of the name language. [skw35] 

1.1.2 Partial Understanding

We have defined compatibility for all possible expressions [skw36] of the language, that is full compatibility. There are many scenarios where a consumer will consume only part of the information set. Partial understanding affects the Text set and the Information conveyed. Partial understanding usually results in a subset of the information (because only part of the information is understand). Interestingly, partial understanding results in is an increases in size or a superset ofs the Accept Text Set and a parallel decrease or subset of the Defined Text Set. This is because the process of extracting a part of the text means that extra content, even those illegal under V1 V2 syntax, becomes part of the Accept Text Set[skw37] .

We can imagine an application that only looks at given names and everything else is ignored. My favourite example of this is a "Baby Name" Wizard. The application might use a simple XPath expression to extract the given name from inside the name. This is a different version of the Name Language, which we will call the Given Name Language. The Accept Text set for the Given Name Language is anything, given, anything[skw38] . The Defined Text set for the Given Name Language is given. The information set for the Given Name language is given. Because the Given Name Language syntax set is more relaxed that the Name Language V1, an addition of the middle name between the given and family is a compatible change for the Given Name Language. There are a variety of other now acceptable names in the Given Name Language.

Our[skw39]  principles [skw40] with respect to compatibility and language versioning need no change to deal with partial understanding. Partial understanding a language is the creation of Language L1' that is compatible [skw41] with Language L1. This is true if L1' Accept Text set > (superset) Language L1 Accept Text set > (superset) L1 Defined Text set > (superset) L1' Defined Text Set AND every text in L1' Defined Text set is compatible [skw42] with L1 AND every text in L1' Accept Text set is compatible [skw43] with L1'.

Interestingly[skw44] , partially understanding a language is creating a language V1', such that the V1 language is a compatible change with the V1'. There may be many different versions that are all partial understandings of a language. We call these related languages "flavours". It may be very difficult for a language designer to know how many different language flavours are in existence. However, a language designer can sometimes use the different flavours to their advantage in designing for a mixture of compatible and incompatible changes. Some changes could be compatible with some flavours but not other. It may be very useful to have some changes be compatible with some flavours, that is those consumers do not need to be updated or changed.

It is crucial to point out that the consumers of partially understood versions of the language are not also producers of the partially understood language. They have relaxed the restrictions on the consuming side, but should not do so on the production side of the language. If a flavour of a language was also used for production, it should have to create an instance that is valid according to the Language V1 rules, not the Language V1'. Perhaps the only exceptions are if they are guaranteed that they will be producing for compatible flavours. Typically this is not the case and hard to determine, so the safest course is to produce according to the Language V1 rules.

We have shown how relaxing the constraints on a language when consuming instances of it can turn an otherwise incompatible change into a compatible change. We have also shown that abiding by the language constraints when producing instances is the safest course. Said more eloquently is the internet robustness principle, "be conservative in what you do, be liberal in what you accept from others" from [tcp].

We will call this style of versioning the "liberal" style of versioning. The "liberal" style of versioning is codified in:

Good Practice

Use Least Partial Languages for "liberal" versioning: Consumers should use a flavour of a language that has the least amount of understanding.

[skw45] The least amount of understanding will be the most liberal or have the largest syntax set possible.

The "liberal" style of versioning has a significant downside in that it can lead to very fragile and hard to evolve software because the "liberal"ness is difficult to code and it does not force producers to be correct in what they produce, causing a vicious cycle of complexity.

There is an opposite style of versioning that says the most effective way of evolving is to force producers to be correct by having strict consumers. We will call this the "conservative" style of versioning. The "conservative" style of versioning is codified in:

Good Practice

Use No Partial Languages for "conservative" versioning: Consumers should fully use and validate a language.

The fullest amount of understanding will find the most errors.

In either "liberal" or "conservative" versioning of consumers, the advice to a producer is the same:

Good Practice

Produce no partial languages: Producers should use the complete version of a language and no partial flavours of a language.

However, the difference between them is that "liberal" consumers might allow producers that aren't fully compatible with the Language whereas "conservative" consumers will be less fault tolerant.

EdNote: I think related to principle of least power. The least powerful the language, the easier to have partial understanding?

Compatibility is defined for the producer and consumer of an individual text. [skw46] Most messaging specifications, such as Web Services, provide inputs and outputs. Using these [skw47] definitions of compatibility, a Web service that updates its output message is considered a newer producer because it is sending a newer version of the message. Conversely, updating the input message makes the service a newer consumer because it is consuming a newer version of the message. All systems of inputs and outputs must consider both when making changes and determining compatibility. For full compatibility, any output messages changes must be forwards compatible (for the older receivers aka consumers) and any input message changes must be backwards compatible (for the older senders aka producers).

1.1.3 Divergent Understanding and Compatibility

Our treatise so far has described a fairly straightfoward evolution of a language, from a first version to a next version. However, extensibility and interoperability are usually directly related. It is an axiom in computing that the lower the optionality (which includes extensibility), the higher the chance of interoperability. Each and every place that extensibility is allowed in a language is also a place for a lack of interoperability. The interoperability problems can arise when producers and consumers do not agree on which version is being used in a text.

In addition to the explicit extensibility defined in a language, there is the actual extensibility and defnition of a language used in an agent. One significant way that divergent understanding can happen is when the actual language definition used is different between the agents. A classic example of this is "HTML TAGSoup". Much of the HTML software, particularly browsers, have an Accept Text Set that is larger than the definition of HTML. For example, many situations of missing end tags are processed without generating an error. This ensures the user experience, at least in the short term, is of higher quality. However, it does suffer long term problems with interoperability when the illegal texts are copied by mechanisms such as "view source". The reason is that the more undocumented strings that are in an Accept Text Set, the more difficult it is to achieve interoperability. The more liberal an agent in accepting texts by increasing the Accept Text Set through expanding the definition of the language, the more difficult interoperability is because not every agent may have the same Accept Text Set. [skw48] 

On the other extreme is XML. XML allows almost no extensibility in it's constructs. Name characters, Tag closures, attribute quoting and attribute allowed values are all very fixed. This has increased interoperability between implementations of XML. However, it has also made it very difficult to move to XML 1.1 because almost all changes are incompatible because of the lack of extensibility. The XML language design was very specifically trying to avoid the "HTML TAGSoup" problem, and it has arguably done that, at a cost of inability to version. These two extremes of design of extensibility exist because of well-thought design. The trade-off between extensibility, interoperability and the Accept Set was planned in advance. Language designers should do the same with their languages.

Good Practice

Analyze Trade-offs for Language: Language designers should analyze the trade-offs between extensibility, interoperability, and actual language Accept Set.

1.1.4 Open or Closed systems

The cost of changes that are not backward or forward compatible is often very high. All the software that uses the language must be updated to the newer version. The magnitude of that cost is directly related to whether the system in question is open or closed.

[Definition: A closed system is one in which all of the producers and consumers are more-or-less tightly connected and under the control of a single organization.] Closed systems can often provide integrity constraints across the entire system. A traditional database is a good example of a closed system: all of the database schemas are known at once, all of the tables are known to conform to the appropriate schema, and all of the elements in the each row are known to be valid for the schema to which the table conforms.

From a versioning perspective, it might be practical in a closed system to say that a new version of a particular language is being introduced into the system at such and such a time and all of the data that conforms to the previous version of the schema will be migrated to the new schema.

[Definition: An open system is one in which some producers and consumers are loosely connected or are not controlled by the same organization. The internet is a good example of an open system.]

In an open system, it's simply not practical to handle language evolution with universal, simultaneous, atomic upgrades to all of the affected software components. Existing producers and receivers outside the immediate control of the organization that has publishing a changed language will continue to use the previous version for some (possibly long) period of time.

Finally, it's important to remember that systems evolve over time and have different requirements at different stages in their life cycle. During development, when the first version of a language is under active development, it may be valuable to pursue a much more aggressive, draconian versioning strategy. After a system is in production and there is an expectation of stability in the language, it may be necessary to proceed with more caution. Being prepared to move forward in a backwards and forwards compatible manner is the strongest argument for worrying about versioning at the very beginning of a project.

1.1.5 Compatibility of languages vs compatibility of applications

From NoahM:The draft is on pretty firm ground when it talks about the information that can be determined from a given input text per some particular language L. I think there are important compatibility statements we can and should make at just that level (see suggestions above), and we should separate them from statements about the compatibility of a particular pair of applications that may communicate using the language. Both are important to include, I think, but they should be in separate chapters, one building on the other. Once you've cleanly told a story about which information can be reliably communicated when sender and receiver interpret using different language versions, you can go on to tell a separate story about whether the applications can indeed work well together. To illustrate what I mean, here are examples at each of the two levels.

Language level incompatibility: Consider a situation in which the same input connotes different information in one version of a language or another. Without reference to any particular application, we can say that the languages are in that respect incompatible. For example, we might imagine a version of a language in which array indexing is 1-based, and a later version in which 0-based indexing is used; the information conveyed by any particular array reference is clearly in some sense incompatible, regardless of the consuming application's needs.

Application-level incompatibility: Now consider two applications designed render the same version of the HTML language. The same tags are supported, with the same layout semantics, etc. One of the applications, however, has a sub-optimal design. Its layout engine has overhead that grows geometrically with the number of layout elements. If you give it a table with 50 rows, it takes 3 seconds to run on some procesor. If you give it a table with 5000 rows it runs for 3 days. Question: is the second application "compatible" with the 5000 row input? In some ways yes, and in some no. It will eventually produce the correct output, but in practice a user would consider it incompatible. This illustrates that compatibility of applications ultimately has to be documented in terms meaningful to the applications. In this case, rendering time is an issue. I think we should not try in this finding to document specific levels of compatibility at the application level and we should especially not fall into the trap of trying to claim it's a Boolean compatible/incompatible relation; in the performance example, it's a matter of degree. So, the terminology needs to be specific to the application and its domain. I do think we can talk about some meta-mechanisms that work at the application level, such as mustUnderstand, but they should be in a section that's separate from the exposition of texts, information, and the degree to which information may be safely extracted from a given text when sender and receiver operate under differing specifications.

The current draft tries to take the approach that we will model application compatibility by defining a new language that is the flavor of (in this case HTML) that a particular consumer will successfully process, but the point is that "success" is sometimes a fuzzy concept. Do we have two languages for this example, one for the documents that completely break application #2 and another for those that just make it run slowly? That seems to be what the finding is doing today, and I'm not convinced it's the right approach. My proposal would be that we just point out the distinction and say: "This first section of the finding for the most part restricts its analysis to the limited question of: what information can be reliably conveyed when a producer and a consumer operate using different versions of what purport to be the same or similar languages? The later sections explore some techniques that can be used by applications to negotiate means of safe interoperation when sender and receiver are written to differing versions of a language specification."

1.2 Why Worry About Extensibility and Versioning?

As texts, or messages, are exchanged between applications, they are processed. Most applications are designed to discriminate between valid and invalid inputs. In order to have any sort of interoperability, a language must be defined or described in some normative way so that the terms "invalid" and "valid" have meaning.

There are a variety of tools that might be employed for this purpose (DTDs, W3C XML Schema, RELAX NG, Schematron, etc.). These tools might be augmented with normative prose documentation or even some application-specific validation logic. In many cases, the schema language is the only validation logic that is available.

It is almost unheard of for a single version of a language to be deployed without requiring some kind of augmentation. Invariably, the original language designer did not include certain terms and constraints. In fact, good designers should not try to define all the possible terms and constraints. This is sometimes called "boiling the ocean". Knowing that a language will not be all things to all people, a language designer can allow parties to extend instances of the language or the language itself. Typically the tools will allow the language designer to specify where extensions in the instance and extensions in the language are allowed. Of note, we do not call extending a text of a language a new version. This limits our discussion of versioning to changes in a language, not changes to instances.

Whether you've deployed ten resources, or a hundred, or a million, if you change a language in such a way that all those resources will consider instances of the new language invalid, you've introduced a versioning problem with real costs.

Once a language is used outside of its development environment, there will be some cost associated with changing it: software, user expectations, and documentation may have to be updated to accommodate the change. Once a language is used in environments outside of a single realm of control, any changes made will introduce multiple versions of the language.

1.3 Why Do Languages Change?

There are many reasons why a different version of a language may be needed. A few of them include:

1.      Bugs may need to be fixed. Production use may reveal defects or oversights that need to be fixed. This may involve changes to components of the language or changes to the semantics of existing components.

2.      Changing requirements may motivate changes in the schema design. For example, a callback may be added to a service that performs some processing so that it is able to notify the caller when processing has completed.

3.      Different flavors of a schema may be desirable. For example, the XHTML 1.0 Recommendation defines strict, transitional, and frameset schemas. All three of those schemas purport to define the same namespace, but they describe very different languages.

And additional schemas may be defined by other specifications, such as the XHTML Basic Recommendation.

Whatever the cause, over time, different versions of the language exist and designing applications to deal with this change in a predictable, useful way requires a versioning strategy.

1.4 How Do Languages Change?

At the most basic level, languages can change in only a few ways:

·         Content: The allowable content can evolve through addition or deletion. In XML, this becomes

o        ElementsNew elements can be added, existing elements can be removed, or the acceptable number of occurrences of an element can change. In addition, the content of an element could change from element only content to mixed content, or vice versa.

For elements with simple content, the type or range of values that are acceptable can change.

o        Attributes: New attributes can be added, existing attributes can be removed, or the type or range of values that are acceptable can change.

·         Semantics: The meaning of an existing term can change.

Of course, the difference between two versions of a language can be an arbitrary number of these changes.

One of the most important aspects of a change is whether or not it is backwards or forwards compatible.

Some typical backwards- and forwards-compatible changes:

·         adding optional components (elements and/or attributes)

·         adding optional content, for example extending an enumeration

Some typical forwards-compatible changes:

·         Decreasing the maximum allowed number of occurrences of a component but not to less than the minimum

·         Decreasing the allowed range of a component

Some typical backwards-compatible changes:

·         Increasing the maximum allowed number of occurrences of a component

·         Increasing the allowed range of a component

Some typical incompatible changes:

·         changing the meaning or semantics of existing components

·         adding required components

·         removing required components

·         restricting a components content model, such as changing a choice to a sequence

1.4.1 Why Extend languages?

The primary motivation to allow instances of a language to be extended is to decentralize the task of designing, maintaining, and implementing extensions. It allows producers to change the instances without going through a centralized authority. It means that changes can occur at the producer or consumer without the language owner approving of them. Consider the effort that the HTML Working Group put into modularity of HTML. Without some decentralized process for extension, every single variant of HTML would have to be called something else or the HTML Working Group would have to agree to include it in the next revision of HTML.

1.5 Kinds of Languages

Ultimately, there are different kinds of languages. The versioning approaches and strategies that are appropriate for one kind of language may not be appropriate for another. Among the various kinds of vocabularies, we find:

·         Just Names: some languages don't actually have a syntax or grammar; they're just lists of names. Using names to identify words in the WordNet database, for example, or the names of functions and operators in XPath2 are examples of "just name" languages.

·         Markup: SGML, HTML, and the non-SGML variants of HTML are all markup languages. XML languages are described in Versioning - XML

·         Non-markup Text: languages designed with a text format. These may be programming languages such as Java or ECMAScript, or data formats like CSS or Comma Separated Values. Typically these are intended for humans to author and or view.

·         binary: languages that are not in a text format. These may be image formats like GIF, JPEG, or even binary encoded XML.

This is by no means an exhaustive list. Nor are these categories exclusive. Many languages have mixed modes. For example, XQuery has a text mode and an XML mode.

2 Versioning Strategies

Versioning is a broad and complex issue. Different communities have different notions about what constitutes a version, what constitutes a reasonable policy, and what the appropriate behavior is in the face of deviations from that policy. Historically, it has always proved more complicated in practice than in theory.

In broad terms, the approaches to versioning fall into a number of classes ranging from "none" to a "big bang":

·         None. No distinction is made between versions of the language. Applications are either expected not to care, or they are expected to cope with any version they encounter.

·         Compatible. Designers are expected to limit changes to those that are either backwards or forwards compatible, or both.

o        Backwards compatibility. Applications are expected to behave properly if they receive a text of the "older" version of a language. Backwards compatible changes allow applications to behave properly if they receive a text of the "older" version of the language.

o        Forwards compatibility. Applications are expected to behave properly if they receive a text of the "newer" version of a language. Forwards compatible changes allow existing applications to behave properly if they receive a text of the "newer" version of the language.

·         Flavors. Applications are expected to behave properly if they receive one of a set of flavors of the text type.

·         Big bang. Applications are expected to abort if they see an unexpected version.

There's no single approach that's always correct. Different application domains will choose different approaches. But by the same token, the approaches that are available depend on other choices, especially with respect to namespaces. This dependency makes it imperative to plan for versioning from the start. If you don't plan for versioning from the start, when you do decide to adopt a plan for versioning, you may be constrained in the available approaches by decisions that you've already made.

A language commonly goes through a lifecycle of iterative development followed by deployment followed by deployment of new versions. The point in the lifecycle will affect the selection of the versioning strategy for the language

Just as there are a number of approaches, there are a number of strategies for implementing an approach. The internet - including MIME, markup languages, and XML languages have successfully used various strategies, either singly or in combination. Summaries of strategies and requirements have been produced for earlier technologies and guided XML Namespaces and Schema, such as [Web Architecture: Extensible Languages].

2.1 Versioning Designs

For any given strategy, there are various designs that achieve the strategy, and some may be more appropriate than others. Among them we find:

2.1.1 Big Bang/Incompatible

As desirable as compatible evolution often is, sometimes a language may not want to allow it. In this model, a consumer will generate a fault if it finds a component it doesn’t understand. An example might be a security specification where a consumer must understand each and every extension. This suffers from the significant drawback that it does not allow compatible changes to occur in the language, as any changes require both consumer and producer to change. A few different designs for incompatible evolution are:

·         Must Understand All: consumers must understand all of the text received and are expected to abort processing if they do not.

·         Must Understand Listed: consumers must understand the text portions that are listed in some well-defined place in the text.

·         Must Understand Marked: consumers must understand the text portions that are marked as being mandatory. SOAP Header Blocks with the mustUnderstand attribute is an example.

2.1.2 Forwards Compatible

Forwards compatible means that producers should be able to extend existing texts with new texts without consumers having to change existing implementations. Extensibility is one step towards this goal, but achieving forwards compatibility also requires a processing model for the extensions. The behavior of software when it encounters an extension should be clear. For this, we introduce the next rule:

Good Practice

Provide Processing Model Rule: Languages SHOULD specify a processing model for dealing with extensions.

Achieving forwards-compatible evolution requires that the processing model must be a substitution mechanism. The instance containing the extension, which isn't known by the consumer, must be transformed into an instance which is of a type known by the consumer.

Good Practice

Provide Substitution model: Languages MUST provide a substitution model for forwards-compatible evolution.

2.1.2.1 Must Ignore Unknowns

Perhaps the simplest substitution model that enables forwards-compatible changes is to ignore content that is not understood. This rule is:

Good Practice

Must Ignore Unknowns Rule: Consumers MUST ignore any text portion that they do not recognize.

This rule does not require that the elements be physically removed; only ignored for processing purposes. There is a great deal of historic usage of the Must Ignore rule. HTML 1, 2 and 3.2 follow the Must Ignore rule as they specify that any unknown start tags or end tags are mapped to nothing during tokenization. HTTP 1.1 [7] specifies that a consumer should ignore any headers it doesn't understand: "Unrecognized header fields SHOULD be ignored by the recipient and MUST be forwarded by transparent proxies." The Must Ignore rule for XML was first standardized in the WebDAV specification RFC 2518 [6] section 14 and later separately published as the Flexible XML Processing Profile [3].

Sometimes the must understand and must ignore approaches can be combined for more selective use. SOAP processors must ignore headers they do not recognize unless the header explicitly identifies itself as one that must be understood.

There are two broad types of Must Ignore rules for dealing with extensions, either ignoring the entire tree or just the unknown part of the tree. The rule for ignoring the entire tree is:

Good Practice

Must Ignore All Rule: The Must Ignore rule applies to unrecognized texts and their descendents in tree based formats.

This variation on must ignore requires the consumer to ignore the text and any children it does not understand. Most data applications, such as Web services that use SOAP header blocks or WSDL extensions, adopt this approach to dealing with unexpected markup. For XML, the Must Ignore all rule was first standardized in the WebDAV specification RFC 2518 [WebDAV] section 14 and later separately published as the [FlexXMLP].

For example, if a message is received with unrecognized elements in a SOAP header block, they must be ignored unless marked as "Must Understand" (see Rule 10 below). Note that this rule is not broken if the unrecognized elements are written to a log file. That is, "ignored" doesn’t mean that unrecognized extensions can’t be processed; only that they can’t be the grounds for failure to process.

Other applications may need a different rule as the application may want to retain the content of an unknown element, perhaps for display purposes. The rule for ignoring the element only is:

Good Practice

Must Ignore Container Rule: The Must Ignore rule applies only to the smallest portion of the tree.

This variation on must ignore requires the consumer to ignore the smallest part of the text that is ignorable. For markup languages, this could be just an element or attribute that it does not understand, but in the case of elements, to process the children of that element. The Must Ignore Container practice was described in [HTML 2.0]

This retains the element descendents in the processing model so that they can still affect interpretation of the text, such as for display purposes.

Ignoring content is a simple solution to the problem of substitution. In order to achieve a compatible evolution, the newer texts of a language must be transformable (or substitutable) into older texts. Object systems typically call this "polymorphism", where a new type can behave as the old type.

2.1.2.2 Fallback Provided

A language can provide mechanisms for explicit fallback if the text is not supported. [MIME] provides multipart/alternative for equivalent, and hence fallback, representations of content. [HTML 4.0] uses this approach in the NOFRAMES element. In XML, the XML Inclusions specification [XInclude] provides a fallback element to handle the case where the putatively included resource cannot be retreived. There are many variations on where the fallback content can be found. For example, a schema language could specify that fallback content is found in a text, in a schema, or even in the schema for the schema language.

2.1.2.3 Supporting functionality

Additional functionality can be provided in a language for determining the capabilities of the system that the text is being interpreted in. A language can provide a mechanism for explicit testing. The XSLT Specification provides a conditional logic element and a function to test for the existence of extension functions. This allows designers of stylesheets to deal with different consumer capabilities in an explicit fashion.

2.1.3 Backwards compatible

In general, providing backwards compatibility is easier than providing forwards compatibility. Backwards compatibility means supporting the previous versions of text in a newer consumer. There are are two significant ways that backward compatibility can be supported.

2.1.3.1 Replacement

In the replacement design, the new version of software replaces the old and the new version of the software supports the old and the new version. That is, producer does not need to distinguish between the old and the newer consumer. For example, a web resource that supports additional Name Information as input does not change the URI of the resource.

2.1.3.2 Side-by-side

In the side-by-side design, the new version of the software and the old version of the software are deployed "side-by-side". One variant is offering both versions of the system, for example by using different URIs for the old and new resources. The request to one resource gets mapped to the other resource behind the scenes using a proxy or gateway. This "alternative" approach works when the intermediary can completely handle or generate the new information (for backwards compatibility) or ignore the new information (for forwards compatibility). For example, adding SSL security to a resource changes the URI but a Web server can typically handle mapping the https: URI to the older http: URI. If both URIs are maintained, then the addition is a compatible change. Another example is where new information is required, such as the priority, and the intermediary can apply a default value to provide the required priority. However, this too has its costs as multiple versions of the software must be supported and maintained over time and there is the added cost of developing the proxy or gateway between the two environments. Further, this does not work in scenarios where the intermediary cannot generate the new required content. For example, if a middle name is required in V2, a middle cannot be generated from just a family and a given name.

2.1.4 Mixtures

Languages can choose a mixture of approaches. For example, XSLT provides both an explicit fallback mechanism for some conditions and explicit testing for others. The SOAP specification, another example, specifies Must Ignore as the default strategy and the ability to dynamically mark components as being in the Must Understand strategy.

2.2 Why Have a Strategy?

Different kinds of languages and different versioning strategies expose different problems. If you don't have a strategy at all, you are effectively choosing the "no versioning" strategy.

It's probably obvious that attempting to deploy a system that provides no versioning mechanism is fraught with peril. Putting the burden of version "discovery" on consumers is probably impractical in anything except a closed system.

At the other end of the spectrum is the "big bang" approach which is also problematic.

"Big bang" is a very coarse-grained approach to versioning. It establishes a single version identifier, either a version number or namespace name, for an entire text.

The semantics of the "big bang" are that applications decide on the basis of the text version whether or not they know how to process that text. If the version isn't recognized, the entire text is rejected. Typically, when introducing a new version using the big bang approach, all of the software that produces or consumes the texts is updated in a sweeping overhaul in which the entire system is brought down, the new software deployed and the system is restarted. This big bang approach to versioning is practical only in circumstances where there is a single controlling authority, and even in that case, it carries with it all manner of problems. The process can take a considerable amount of time, leaving the system out of commission for hours if not days. This can result in significant losses if the system is a key component of a revenue generating business process and the cost of coordinating the system overhaul can also be quite costly as well.

The "big bang" approach is appropriate when the new version is radically different from its predecessor. But in many cases, the changes are incremental and often a consumer could, in practice, cope with the new version. For example, it might be that there are many messages that don't use any features of the new version or perhaps it is appropriate to simply ignore elements that are not recognized.

For example, consider two resources exchanging messages. Imagine that some future version of the language that they are using defines a new "priority" element. Because producers and consumers are distributed, it may happen that an old consumer, one unprepared for a priority element, encounters a message sent by a newer producer.

If big bang versioning is used, old systems will reject the new message. However, if the versioning strategy allowed the old consumer to simply ignore unrecognized content, it's quite possible that other components of the system could simply adapt to the previous behavior. In effect, the old system would ignore the priority element and its descendents so it would "see" a message that looks just like the old format it is expecting.

For the producer, the result would be that the request is fulfilled, though perhaps in a more or less timely fashion than expected. In many cases this may be better behavior than receiving an error. In particular, producers using the new format can be written to cope with the possibility that they will be speaking to old consumers.

If the new system needs to make sure that priority is respected, then it can change the purchase order's name or namespace to indicate that the new behavior is not considered backwards compatible.

Often, what is needed is some sort of middle ground solution. An evolving system should be designed with backwards and forwards compatibility in mind.

3 Language Requirements

Given the types of versioning strategies and designs that are available, there are some key requirements the language designer consider in choosing a strategy and design.

3.1 What language form

Languages can be expressed in text, comma separated values, XML, SGML, binary, source code, and almost any kind of form. See the Architecture of the World-Wide Web section on data formats for more information - http://www.w3.org/TR/2004/REC-webarch-20041215/#formats.

3.2 Can 3rd parties extend the language?

It is sometimes desirable to prevent 3rd parties from extending languages, but it does happen. An example may be a tightly constrained security environment where distributed authoring is considered a "bug" rather than a feature.

3.3 Can 3rd parties extend the language in a compatible way?

If so, a substitution mechanism is required for forwards compatibility. If an older consumer has no mechanism for dealing with new content, then forwards-compatible evolution isn't possible. One simple substitution mechanism is simply ignoring the unrecognized components.

3.4 Can 3rd parties extend the language in an incompatible way?

If so, and if compatible extensions are also possible, then it must be possible to identify incompatible changes so that they can override the substitution mechanism used for extensible changes.

In environments where unrecognized components are ignored, a "must understand" component can be added to identify incompatible changes.

If compatible changes are not possible, then incompatible changes simply become the default. For example, WS-Security mandates that 3rd parties can only provide incompatible extensions. Unlike most languages, a security language has unique requirements where the consequences of ignored data can be severe. WS-Security accomplished this by specifying that all extensions are required to be understood and there is no substitution mechanism.

3.5 Can the designer extend the language in a compatible way?

As with 3rd parties compatible extensions, a substitution mechanism for the designer’s extensions is required for forwards compatibility.

3.6 Can the designer extend the language in an incompatible way?

In XML, the designer can always do this by using new namespace names, element names or version numbers. In other languages, this may not be possible because there is no mechanism for indicating the incompatible change.

3.7 Is the vocabulary a stand-alone language or an extension of another vocabulary?

A part of this question is whether the language depends on another language. That determines which, if any, facilities are provided for the containing language and what must be provided by a contained language.

SOAP is an example of a container language. The SOAP processing model applies uniformly to all headers, which may employ soap:mustUnderstand to identify incompatible changes, even though the contents of the SOAP headers are languages independent from SOAP.

3.8 What Schema language(s)?

Choosing a schema language or languages guides the language design in many ways. Some features, particularly extensibility, must be anticipated in the first version of a language in order to take advantage of the features of some schema languages.

In addition, various features may be incompatible across different languages. For example, writing a V2 compatible schema in W3C XML Schema requires special design, which is not required in a schema language such as RELAX NG. Some of the language design choices mandated by W3C XML Schema are discussed in other sections of this Finding.

3.9 Should extensions or versions be expressible in the Schema language?

The ability to write a schema for extensions or versions is directly affected by the schema design and the compatibility desires.

3.10 Requirements Summary

Every language design will make decisions about these requirements. These requirements can be expressed in a table form:

Requirement

Language form

 

Schema Lang

 

3rd party compatibly extend

 

3rd party incompatibly extend

 

Designer incompatibly extend

 

stand-alone

 

4 Language Design

Upon answering these questions, there are some key decisions that a language developer makes, whether they are consciously made or not.

4.1 Schema language design choices or constraints.

If the language is intended to be capable of compatible extensibility, then a few specific schema design choices must be followed.

4.2 Substitution Mechanism.

Forwards compatibility can only be achieved by providing a substitution mechanism for Version 2 instances or Version 1 extensions to V1 without knowledge of V2.  A V1 consumer must be able to transform any instances, such as V1 + extensions, to a V1 instance in order to process the instance. The "Must Ignore unknown" rule is a simple substitution mechanism. This rule says that any extensions are "ignored". Using it, a V1 + extensions text is transformed into a V1 text by ignoring the extensions. Others substitution mechanisms exist, such as the fallback model in XSLT.

4.3 Component identification

The identification of components into language versions or extensions has a variety of general mechanisms related to namespaces. These are detailed in the Versioning section.

4.4 Identification of incompatible extensions

The identification of versions is covered by language identification, but 3rd parties cannot arbitrarily change versions or change namespaces. They may need a mechanism to indicate that an extension is an incompatible change. A couple of mechanisms are a "Must Understand" identifier (such as a flag or list of required namespaces) or requiring that extensions are in substitution groups.

4.5 Design Summary

Every language design will make a decision in these areas. These designs can also be expressed in a table form:

Design

Schema design

 

Substitution Mechanism

 

Component Identification

 

Incompatible Ext identification

 

5 Identifying and Extending Languages

Designing extensibility into languages typically results in systems that are more loosely coupled. Extensibility allows authors to change instances without going through a centralized authority, and may allow the centralized authority greater opportunities for versioning. The common characteristic of a compatible change is the use of extensibility.

A supreme example of the benefits of extensibility is HTML. The first version of HTML was designed for extensibility; it said that "unknown markup" may be encountered. An example of this in action is the addition of the IMG tag by the Mosaic browser team. This is a great example of a language designed for extensibility.

The first rule introduced in this Finding relating to extensibility is:

Good Practice

Compatible Versioning rule: Any Language intended for compatible versioning MUST have extensibility.

A fundamental requirement for extensibiliy and versioning is to be able to determine the language Texts and sub-texts. Any language that does not allow identification of the language will probably have a more difficult time being versioned.

Good Practice

Language Identification rule: Any Languages intended for versioning SHOULD have a version identification strategy

5.1 Version Numbers

Having multiple versions naturally leads to the need to identify versions. Version identification has traditionally been done with a decimal separating the major versions from the minor versions, ie "8.1", "1.0". Often the definition of a "major" change is that it is incompatible, and the definition of a "minor" change is that it is forwards- and/or backwards - compatible. Usually the first broadly available version starts at "1.0". A compatible version change from 1.0 might be identified as "1.1" and an incompatible change as "2.0".

The version numbers can be contained in the texts, in the protocol messages containing in the text, or the address for the protocol messages. Some examples are shown below:

Example 3: Name examples.

<name version="2.0">

  <given>Dave</given>

  <family>Orchard</family>

</name>

 

<span class="fn20">Dave Orchard</span>

 

urn:nameschemev2:given:Dave:family:Orchard

 

<?XML version="1.1"?>

 

GET /name/123456789  HTTP/1.1

 

GET /name/v2/123456789/ HTTP 1.1

It should be noted that associating version number changes with compatibility changes may be idealistic as there abundant cases where this system does not hold. New major version identifiers are often aligned with product releases, or incompatible changes identified as a "minor" change. A good example of an incompatible changed identified as a minor change is XML 1.1. XML 1.0 processors cannot process all XML 1.1 documents because XML 1.1 extended XML 1.0 where XML 1.0 does not allow such extension.

Unfortunately, version numbers often wind up looking very similar to the big bang approach. In many approaches, each language is given a version identifier, almost always a number, that's incremented each time the language changes. Although it's possible to design a system with version numbers that enables both backward and forward compatibility - for example XSLT - typically a version change is treated as if that the new language is not backwards compatible with the old language.

Some efforts, such as HTTP, try to have the best of both worlds by allowing for extensibility (in HTTP's case, via headers) as well as version numbers that explicitly identify when a new version is backwards compatible with an old version.

One argument in favor of version numbers is that they allow one to determine what is a 'new version' and what is an 'old version'. But in practice this is not necessarily true. For example, RSS has 0.9x, 1.x, and 2.x versions, all being actively developed in parallel. In effect the version numbers, even though they appear to be ordered, are simply opaque identifiers. Using version numbers does not gaurantee that version 1+x has any particular relationship to version 1.

Version numbers typically work best when versioning and extending a language is done in a centralized and linear manner. The makeup of each version can then be consistent and well described.

5.2 XML Namespaces

There are many cases where decentralized and non-linear versioning is desired. The desire for decentralized and non-linear versioning and extensibility was a large motivator for XML and for XML Namespaces. The self-describing and extensible nature of XML markup, and the addition of XML Namespaces, provides a framework for developing languages that can evolve in a decentralized manner. XML Namespaces [ XML Namespaces 1.0] provide a mechanism for associating a URI with an XML element or attribute name, thus specifying the language of the name. This also serves to prevent name collisions.

6 Case Studies

6.1 HTML

Requirement

Language form

Markup

Schema Lang

DTD with changes

3rd party compatibly extend

Yes

3rd party incompatibly extend

No

Designer incompatibly extend

Yes

stand-alone

Yes

Schema design

Extensible

Substitution Mechanism

Must Ignore Unknowns

Component Identification

DTD + Name

Incompatible Ext identification

None

6.2 XML

Requirement

Language form

Markup

Schema Lang

Simple Extended Backus-Naur Form

3rd party compatibly extend

No

3rd party incompatibly extend

No

Designer incompatibly extend

Yes

stand-alone

Yes

Schema design

Backus-Naur without extensibility in XML 1.0 constructs

Substitution Mechanism

None

Component Identification

Name or Qualified Name

Incompatible Ext identification

N/A

6.3 CSS

6.4 Microformats

Requirement

Language form

text documentation

Schema Lang

depending upon microformat

3rd party compatibly extend

Yes

3rd party incompatibly extend

No

Designer incompatibly extend

Yes

stand-alone

No, embedded in HTML

Schema design

text description of HTML including class attribute values

Substitution Mechanism

HTML's Must Ignore Unknown

Component Identification

string in class attribute

Incompatible Ext identification

None

7 Extension versus Versioning

Languages that are designed for decentralized extensibility, notably but not limited to XML, have the interesting situation where the distinction between an extension and a version can be quite blurred, depending upon the language designer’s choices.

The typical way of thinking of these two concepts is that extension is typically the addition of components over space; that is, designers other than the language’s creator are adding components. Versioning is typically the addition of components over time, under the designer’s explicit control. In either case, a change to the language may be done in a compatible or an incompatible way. The simple cases of extensions are compatible decentralized additions and versions are compatible or incompatible centralized changes are how we typically distinguish the terms. But these break down depending upon how the language is designed.

There are a couple of scenarios that illustrate the ambiguity in these terms. Imagine that version 1.0 of a Name consists of "First" and "Last" elements. A 3rd party author extends the Name with a "middle" element in a new namespace which they control.

In scenario 1, the Name author decides to formally incorporate the middle name as an optional (and hence compatible) addition to the name, producing version 1.1 of the Name type. They do this by referring to the third party’s definition for middle names. This is typically considered a new "version" of the Name and would probably result in a new definition. If the Name author re-uses the existing names for compatible revisions, there will be no difference in a text containing middle that is of Version 1.0 or Version 1.1 type. The texts are the same, and thus the distinction between a "version" and an "extension" is meaningless for an individual text.

In scenario 2, the middle author decides that the middle name is a mandatory part of the Name type. They were provided a mechanism for indicating an incompatible change and they use it. Now an instance of Name with the middle is incompatible with version 1.0 of the Name. What "version" of the Name is this middle, and is the middle an "extension" or a "version"? It isn’t 1.0. It’s probably more accurately thought of as a version defined by the 3rd party. Again, the presence of the "extension" is actually an incompatible change.

These two examples—a 3rd party extension being added into a compatible version and a 3rd party extension resulting in an incompatible version—show the ability to specify (in)compatibility has blurred the distinction between these two terms.  

8 Conclusion

This Finding is intended to motivate language designers to plan for versioning and extensibility in the languages from the very first version. It details the downsides of ignoring versioning. To help the language designer provide versioning in their language, the finding describes a number of questions, decisions and rules for using in language construction and extension. The main goal of the set of rules is to allow language designers to know their options for language design, and make backwards- and forwards-compatible changes to their languages to achieve loose coupling between systems should that desirable.

9 References

FOLDOC

Free Online Dictionary of Computing. (See http://wombat.doc.ic.ac.uk/foldoc/.)

FlexXMLP

Flexible XML Processing Profile. (See http://www.upnp.org/download/draft-goland-fxpp-01.txt.)

tcp

RFC 793, TCP (See http://www.ietf.org/rfc/rfc793.txt.)

MIME

RFC 1521, MIME. (See http://www.ietf.org/rfc/rfc1521.txt.)

HTML 2.0

RFC 1866, HTML 2.0. (See http://www.ietf.org/rfc/rfc1866.txt.)

WebDAV XMLIgnore post

Yaron GolandXML Ignore proposed for WebDAV (See http://lists.w3.org/Archives/Public/w3c-dist-auth/1997AprJun/0190.html.)

WebDAV

RFC 2518, WebDAV (See http://www.ietf.org/rfc/rfc2518.txt.)

HTTP

RFC 2616, HTTP (See http://www.ietf.org/rfc/rfc2616.txt.)

HTML 4.0

HTML 4.0. (See http://www.w3.org/TR/1998/REC-html40-19980424/.)

TBL Mandatory Extensions

Berners-Lee. Web Architecture: Mandatory extensions. (See http://www.w3.org/DesignIssues/Mandatory.html.)

TBL Extensible languages

Berners-Lee. Web Architecture: Extensible languages. (See http://www.w3.org/DesignIssues/Extensible.html.)

TBL Evolution

Berners-Lee. Web Architecture: Evolvability. (See http://www.w3.org/DesignIssues/Evolution.html.)

Web Architecture: Extensible Languages

Berners-Lee and Connolly, ed. Web Architecture: Extensible Languages World Wide Web Consortium, 1998. (See http://www.w3.org/TR/1998/NOTE-webarch-extlang-19980210.)

HTML Document types

Connolly, ed. HTML Document dialects World Wide Web Consortium, 1996. (See http://www.w3.org/MarkUp/WD-doctypes.)

SOAP 1.2

W3C Recommendation, SOAP 1.2 Part 1: Messaging Framework (See http://www.w3.org/TR/SOAP/.)

WSDL 1.1

W3C Note, WSDL 1.1 (See http://www.w3.org/TR/WSDL/.)

WS-Policy 1.2

W3C Note, WS-Policy 1.2 (See http://www.w3.org/Submissions/WS-Policy/.)

XML 1.0

W3C Recommendation, XML 1.0 (See http://www.w3.org/TR/REC-xml.)

XInclude

W3C Working Draft, XML Inclusions (See http://www.w3.org/TR-Xinclude.)

XML Namespaces

W3C Recommendation, XML Namespaces (See http://www.w3.org/TR/REC-xml-names.)

XML Schema Part 2

W3C Recommendation, XML Schema, Part 2 (See http://www.w3.org/TR/xmlschema-2.)

XML Schema Wildcard Test Collection

XML Schema Wildcard Test collection (See http://www.w3.org/XML/2001/05/xmlschema-test-collection/result-ms-wildcards.htm.)

XFront Schema Best Practices

XFront Schema Best Practices (See http://www.xfront.com/BestPracticesHomepage.html.)

XML.com Schema Design Patterns

Dare ObasanjoXML.com Schema design patterns (See http://www.xml.com/pub/a/2002/07/03/schema_design.html.)

Dave Orchard writings on Extensibility and Versioning

Dave Orchard writings on extensibility and versioning (See http://www.pacificspirit.com/Authoring/Compatibility.)

10 Acknowledgements

The author thanks Norm Walsh for many contributions as co-editor until 2005. Also thanks the many reviewers that have contributed to the article, particularly David Bau, William Cox, Ed Dumbill, Chris Ferris, Yaron Goland, Hal Lockhart, Mark Nottingham, Jeffrey Schlimmer, Cliff Schmidt, and Norman Walsh.

 


 [skw1]Given that this document is intended to be more general and it’s companion more XML specific, this para seems too XML specific.

 [skw2]I’m trying to decide whether a set of texts and a set of syntactic constraints amount to extensional and intensional ways of specifying the same thing. ie. are the syntactic constraints production rules from which the texts of the language may be generated? Maybe they aren’t… may be they are simply a test for membership that doesn’t provide for the production of further texts.

 [skw3]It would seem appropriate to provide a definition for either “information” or “set of information”.

 [skw4]Could include regexps and BNF as nonXML centric ‘constraint’ languages.

 [skw5]I can’t quite work out the point of this sentence. I can determine whether it is saying that  the use of whitespace and punctuation are part of the language definition or not. FWIW it seems to me that they should be. Ie. the constraints of the language define (amongst other things) the use of vocab terms (that needs a defn), whitespace and ouctuation (amongst other things).. Unless there is strong reason to retain this sentence, I suggest it be deleted.

 [skw6]Defn please…. Or is this the same as Act of Consumption? Hmmm… production or concumption are acts of interpretation – I can see that for consumption...

 [skw7]This feels like a definite thing that should have a definition. Binding to what?  Language Binding does not appear as a component of a language as listed above.

 [skw8]Is that two individual texts that just happen to overlap – or is that the entire set of texts are at least in a subset relationship if  not identical  -ie. have the same extension (enumeration of members)

 [skw9]The re relationships between “Information” “Information Set” and “Semantics” needs more development. It is also not at all clear that a given text has single fixed meaning independent of  its context of use – which I think is the point of some of Pat Hayes earlier interventions with references to Quine.

 [skw10]It is difficult to talks about different versions of an instance. Instances are strings/texts. If the different version of an instance is different then it is a different instance. In the extensional model expressed as a set of texts then all the possible instances are there in that set and you are just picking between them. I have difficulty then in conceiving what a version of an instance is.

 [skw11]I know what I think this is trying to say – but it is awkward. “Applications receive  strings of a language which may have been produced using a version of the language that is different from the language version that the receiver was expecting”

 [skw12]What does it mean to compare a language (Name Language V1) with an instance (V2 instance). Intuition suggests match/no match – but there’s a category error here. Maybe this means testing that theV2 instance is a member of the V1 texts.

 [skw13]This is a great example of context dependence in the definition of terms. Taken in isolation the bracketed defn appears to be defining the word “Extensible”. However, in context, it is either defining the term “Extensible Language” or defining “Extensible” only when used as a qualifier in conjcuntion with the word language.

 [skw14]It’s not that the processor is newer than the language it consumes (that is inevitable) it is that it is intended as a consumer of instances arisng from the newer version of the language. Suggest: “A language change is backwards compatible if consumers of the revised language version can correctly process all instances of the unrevised language version.” Could possibly omit the trailing “versions” but wanted to cover the case of say HTML being consider as the language and say HTML 3.2, 4.0, 4.01 etc being HTML language versions. Could also think of HTML 3.2, 4.0, 4.01 etc as distinct language that are in a version relation with some abstract HTML language – and in successor/predecessor relations to each other.

 [skw15]“Language backward compatibility means that a consumer of a newer language version can be rolled out in a way that systems continue to operate in the presence of producers of the older language version.” I don’t see that rolling out a consumer of the new language version could “break” a producer of the old language version – however I do see that systems build around both could fail if the language change were not backward compatible.

 [skw16]Thesa are not versions of messages, but texts for the text set of older and newer language versions (I think).

 [skw17]Ditto

 [skw18]Similar to previous comments: I think new/old should be attributed to the language version being produced/consumed rather than the producer/consumer themselves. That way we keep all the discussion of new/old in terms of language versions rather than processors.

 [skw19]Version of an Instance – see earlier comments.

 [skw20]Ditto

 [skw21]“A backward compatible change to a language enables consumers of the updated language to be deployed without having to update producers. A forward compatible change to a language allows producers of the updated language to be deployed without having to update consumers.”

 [skw22]Given the discussion on version identification – the use of V.N and V.N+1 sort of reinforces the notion of version numbering rather than identification. Also, I think it would be good to note somewhere that versions don’t necessarily form an orderly linear sequence – they may branch an merge as well.

 [skw23]The previous (but one) comment could serve as a replacement for this sentence and the previous sentence.

 [skw24]Forward and Backward compatible change both seem desirable – particular since communication is typically not unidirectional. A forward and backward compatible change in a language suggest that it is impossible to make change. OTOH if the texts can be partitioned by ‘direction’ such that a one party, changes to inbound texts are “backward compatible” and changes to outbound texts are forward compatible – there is some hope of updating peers of that kind ahead of its ‘complement’.

 [skw25]Information sounds dangerously close to ‘meaning’ here – but perhaps not J

 [skw26]While this makes sense… is it not the case that such a text does in fact have a mapping to information for the parts of the text that are recognised – ie. it maps to the same information as a text without the extension. Ie. Accepts Text do (often) have a mapping to Information.

 [skw27]Discussion up to this point has mostly been about version changes. In this para, focus seems to have shifted to extensions to a core language and producers/consumers that implement (or not) extensions to the core language. If that is not the case then it is at least the case that there is some base language and some revised or extended language and I am entirely confused as to which language version/extension the texts sets being spoken of refer

 

Oh… maybe I see… now we are speaking of compatibility between a consumer and a language definition – rather than compatibility of changes in the defn of a language. Here the language defn is fixed, but the consumer is implementing more or less of the texts in the accept set.

 

I think that we need to more carefully highlight when we are using compatibility as a relation between language definitions and when we are using it as a relation between a language defn and a consumer/producer agent. (or try to normalise things so that we only speak of compatibility between language defns).

 [skw28]Defined Texts only or Accept Texts?

 [skw29]This is now a compatibility relation between a text and a language… This should really be wrapp in a “For all ‘T in {L1_Texts”… then we could develop an expression about compatibility of languages (maybe).

 

Ok… I see this is just scaffolding for the next bit which does the “For all” wrapping.

 [skw30]Broadly I think this is good… these defn’s do hang on what is meant by “I1 compatible with I2” which seems to go undefined.

 [skw31]A diagram of these three relationships would be really good/helpful.

 [skw32]Is this not this second clause implict in the superset relation and therefore redundant

 [skw33]Likewise… though this may hinge of the T->Language compatibility entailing information->information compatability.

 [skw34]Have we really, I don’t think we have defined what  extensibility is carefully enough to make that claim.

 [skw35]Oh that is just too simplistic. Taken literally that says that a text of the Name Language is a text of the PO language – whereas it is more of a convolution where texts from the Name language are embedded within the PO language.

“The forward, backwards and full compatibility definitions account for composition of languages because….” I think that claim needs demonstrating – I also don’t know what “account for” means.

 [skw36]Is that all possible texts of a language?

 [skw37]Don’t know if I have this right… but leaving an open extensibility point  creates members of the accept text set. Defining an extension move some members of the accept set into the defined set and some disappear entirely – the defined set grows and the accept set shrinks (remembering that  defined set is a subset of accept set).

 [skw38]Looks like a production rule rather than a a set… probably need to go more slowly.

 [skw39]Surely, in pratice, L1’ is infact the earlier language is large accept set arises because of an open extensibility point which L1 at least partially closes through the definition of some extension.

 [skw40] I have yet to encounter any principles in the narrative so far.

 [skw41]What sort of compatible?

 [skw42]Ditto

 

 [skw43]Ditto

 

 [skw44]Partial understanding seems to be a technique or quality of a language design which makes some provision for arbitrary content.

 

“partially understanding a language is creating a language V1…” doesn’t feel like the right way to express it.

I like the use case which seems more to be about subseting a language but defining the subset such that it shares the accept set of the full language. Just the defined sets of the subset language are subsets of the defined sets of the full language.

 [skw45]This probably needs work. There is no defn of a “least partial language”…. And presumable the least amount of ‘understanding’ is none at all…

 [skw46]I suspect that is is broader than individual texts – but whole sets of text that conform to the constraints of the language – however expressed. Ie. not just one text.

 [skw47]Which ones (internal reference?)

 [skw48]I fear that it may also be the case that not every agent has the same defined set – or at least if it does the relation ot ‘information’is different eg. Different DOMs for the same defined text (though similar/identical visual presentation).