[XMLVersioning-41] Comments and Suggestions on Draft Extensibility Finding from noah_mendelsohn@us.ibm.com on 2005-02-20 (www-tag@w3.org from February 2005)

From: <noah_mendelsohn@us.ibm.com>
Date: Sun, 20 Feb 2005 12:30:37 -0500
To: "David Orchard" <dorchard@bea.com>, www-tag@w3.org
Message-ID: <OFAF527DE1.9F11FBE0-ON85256FAB.007EB8D5@lotus.com>

Background
----------

Dave Orchard is leading the TAG's effort on extensibility and versioning,
and with help from co-editor Norm Walsh, Dave has been writing an
extensive two part draft finding. Copies of a revised draft were posted
to this list in November, just before the TAG's Cambridge F2F [1]. Few TAG members read the revisions in time for the meeting, but Dave
did walk us through them. Dan Connolly submitted some comments later [2]
which generated a bit of discussion [3,4].

At the meeting, I indicated that I thought the drafts would benefit from
more focus on framing the broader issues relating to versioning, XML and
the Web, perhaps at the expense of some details relating to XML Schema 1.0
and particular XML versioning idioms. Such broader issues might include:
1) how versioning and extensibility choices affect the utility and
stability of XML-based Web technologies and 2) investigation of a somewhat
broader range of XML use cases, and 3) deeper exploration of the general
characteristics that we might want from any particular solutions.

The TAG assigned me an action to make more detailed suggestions, and to
help Dave moving forward. This note is in fulfillment of the first part
that assignment, I.e. to set out some of the directions I'd like to see
explored. I hope to work informally with Dave on whether and how to
integrate these ideas. I'm sure we'll have lots of opportunity to talk at
the plenary. I should say that overall I like a lot of what he and Norm
have written, and I hope these will be viewed as constructive suggestions.

Overview of Comments, Suggestions, Concerns
-------------------------------------------

I. Pros and cons of extensibility

The "first rule" introduced in the draft is a Good Practice Note (GPN)
that says [5]: "Allow Extensibility rule: Languages SHOULD be designed
for extensibility." Other GPNs advocate specific idioms for doing this.
In my opinion, this somewhat jumps to a conclusion regarding one of the
most difficult and important tradeoffs relating to extensibility: when do
the benefits outweigh the costs?

I think it's fair to say that some of the most successful Web technologies
have succeeded as much from the ways that they are inflexible as from the
ways that they are extensible. XML, which is arguably a success, had as
one of its original goals: "The number of optional features in XML is to
be kept to the absolute minimum, ideally zero."[6]. Except for the ability
to define your own element and attribute names and choose character
encodings, XML is remarkably inflexible and not particularly extensible.
Sometimes that's frustrating: we couldn't use XML Schema in place of DTDs
in the internal subset, and it's proving very hard to roll out the new
content conventions for XML 1.1. Users rightly value the very high
compatibility that results from XML's inflexibility. Although the draft
correctly cites HTML's open content and "must ignore" tag rules as a
success, there have also been serious interoperability problems as various
vendors exploited that flexibility to introduce their own flavors of HTML.

I suspect that similar tradeoffs will apply as XML vocabularies are
designed for other purposes: extensibility tends to stand in opposition
to interoperability, and both are important. I think the finding would be
much stronger if it explored such tradeoffs, and gave some more nuanced
guidance as to when things should be locked down and when they should be
extensible. In fact, such analysis could be one of the essential
contributions of the finding. Yes, the answer is often to provide for
certain forms of extensibility, but we shouldn't recommend that blindly. I
think this is a subtle question that's particularly appropriate to the
scope and mission of the TAG.

II. Relationship to namespaces

The recent semi-permathread on immutability of namespaces suggests that
the community would welcome a lucid analysis of the relationship of
namespaces to vocabularies, languages and to versioning of both. Part 2
of the drafts does discuss various strategies, but the permathread
suggests that the community is looking for >principles< relating to the
immutability or lack thereof of a namespace, principles relating the use
of namespaces to the deployment of language versions and schemas, and
perhaps principles explaining what role if any namespaces should play in
determining how an application should interpret dialects of the
vocabularies that it processes.

III. Dealing with partial understanding

The draft introduces definitions like "forwards-compatible" [7]:

"A language change is forwards compatible if older processors can process
all instances of the newer language."

It also suggests that [8]:

"Forwards compatibility can only be achieved by providing a substitution
mechanism for Version 2 instances or Version 1 extensions to V1 without
knowledge of V2. A V1 consumer must be able to transform any instances,
such as V1 + extensions, to a V1 instance in order to process the
instance."

The finding would be stronger if it stepped up to the fact that processing
is a matter of degree. In an extensible system, it's common that even an
early version of an application will have partial ability to process
features introduced later. Consider a new element introduced into a
vocabulary. Can it be completely ignored, I.e. safely eliminated by a
substitution? Well, I suspect that if there is a signature on the
document then the new element is signed along with the others, even if not
otherwise processed. If you save the document on disk, do you not save
the elements you didn't understand in detail? Maybe; it depends why
you're saving. If you're a SOAP intermediary, do you relay the
misunderstood elements? SOAP gives you an attribute [9] that allows you
to request such relay of content that was not otherwise understood, and
SOAP specifically allows content from such elements to be used as input to
other processing (e.g. digital signatures, logging, etc.). If you have
function to print an XML document, do you print content from the new
element? Perhaps not, but you might also have default printing rules or
heuristics that you could use. The version 4 word processor mentioned in
[7] may indeed successfully read version 5 documents, but may produce
sub-optimal or incorrect output from some of them. All of these are
examples of systems in which partial understanding leads to useful
processing. Furthermore, if two different applications are deployed based
on version 1 of a language, those applications may differ in their ability
to deal with contrstucts that are introduced later.

I think the drafts jump a bit too quickly to proposals like "a
substitution mechanism" and "mustIgnore", and thus obscure important
issues relating to partial understanding. Indeed, I'm not convinced that
simple substitution mechanisms are the right framework for dealing with
partial interoperation.

By accurately modeling a more variable notion of compatibility, it also
becomes possible to explore a question that the schema WG has been
considering in detail: how can a schema language help an application to
sort out its different levels of understanding of particular content (e.g.
what the application should store, what it should print, which content
should be processed with what conventions)? Various options have been
suggested, including: (a) because W3C XML schemas uniquely attribute each
element in an instance to a particle in a schema content model, you can
tell which elements were validated by wildcards -- that might suggest
content you can tolerate but don't fully understand; (b) validate various
subsets of the document (different substitutions) against multiple schemas
or in various forms of fallback mode when content is not found to be fully
valid. The point is that, to explore such questions, you have to be very
careful with assumptions about what it means for an application to
"process" an instance, and how such assumptions relate to schema validity.

Thus, I think the finding should more carefully deal with partial
understanding of language constructs, and the relationship to schemas.

IV. Need general guidelines for XML and Schema solutions

I think it's healthy to set up goals and success criteria separately from
proposed solutions. The draft does some of this, insofar as it makes the
case that flexible extensibility is a goal. I think there are some more
detailed goals that should be set out or considered before getting into
particular XML and Schema idioms. Some that occurred to me are in the
white paper I wrote last year [10,11], including:

* The same vocabulary may be versioned or fixed repeatedly. Accordingly,
any general approach should be convenient to use even after 20 or 30 such
revisions. Both instances and schemas of the later versions should be
easy to create and use.

* The versioning mechanisms should (in most cases) not presume particular
instance constructions such as <extension> elements.

* In some but not in all cases, some degree of forward and/or backward
compatibility is be required: I.e. it should be possible but not
essential to write early schemas that will somehow accept content that is
not fully defined until later, and schemas for later versions will often
but not always validate earlier forms of the vocabulary. (The draft does
cover this one, I think.)

* Conversely, breaking changes should not in all cases be forbidden. For
example, it may be that an early construct is deprecated at some later
time, and perhaps completely disallowed eventually. Likewise, later
versions may introduce constructs that are rejected outright by earlier
ones.

* It should be possible to check for or force various sorts of forward or
backward compatibility when desired (this is the notion of partial
recognition and processing, mentioned in III above).

* Schemas for versions of a vocabulary may but need not form a sequence or
tree, in which later versions somehow directly reference particular schema
documents for earlier versions. This flexibility allows for possible
redefinition of the same vocabulary by multiple organizations or in more
than one schema (e.g. there's a debug schema and a production schema,
neither based explicitly on the other).

* A consequence of the point above is that the schema for version x is not
necessarily expressed as a delta on or by direct reference to the schema
for version x-1, if in fact the versions form a sequence at all. Such
incremental definition schemes are convenient, but do not necessarily
scale to the case where the same vocabulary is revised 20 or 30 times. In
such a case one would need up to 30 schema documents to assemble the
effective schema. Thus, such incremental schemes should be allowed where
useful, but not presumed in all cases.

* No unnecessary assumptions should be made regarding the relationships
between vocabularies and XML Namespaces. Often, a vocabulary will be
expressed primarily as a single XML namespace. Often, to maintain forward
and backward compatibility, that same namespace will be used in subsequent
versions as well. Nothing in the overall XML mechanisms to support
versioning (e.g. schema language constructs) should prohibit the use or
coordinated evolution of multiple namespaces to define one or more
languages, the addition of new namespaces in subsequent versions of a
language, etc. (Here I admit I'm staking out a personal position on the
Namespaces question raised in II above).

The above is NOT necessarily the right list, but I think the finding would
make a contribution if it set out such principles separately from any
proposed solutions. If we do retain a Part 2 that discusses particular
extensibility idioms, then they should each be rated against explicit
goals such as the examples listed above.

V. The relationship between syntax and semantics

Though it mentions other options in passing, the finding deals primarily
with examples in which the syntax of the XML more or less directly models
the evolving semantics of the underlying data or application. For
example, a given parent element may allow for elements or attributes to be
introduced to express features of the language as it evolves. This is
indeed a common idiom, and it's appropriate that the drafts explore it.

Nonetheless, such approaches do not cover the full spectrum of common
mechanisms for versioning XML vocabularies. Perhaps, as in SOAP encoding
or RDF, the XML is a serialization for a higher level model, versioning of
which is not well expressed at the element and attribute level. We should
go into more detail about the implications for XML and schemas, I think.
Sometimes new versions of a language specify coordinated updates to the
use of or constraints on the contents of elements or attributes scattered
throughout a document. Perhaps an attribute changes the meaning of a
legacy element (e.g. currency="peso"). Perhaps the specification of a
SOAP header requires that it be used with other headers (which may be
interspersed with other headers). In all these cases, it becomes
difficult to tell the versioning story entirely in terms of XML elements
and attributes, and it's often problematic to do a useful job of
expressing the pertinent constraints in XML Schema languages.

In such systems, the extensibility of semantics is only indirectly related
to the syntactic structure of the XML. If the finding is to achieve its
goal of exploring the versioning of XML vocabularies, then it's as
important to either deal with such approaches, or to make the case that
they are not important. I think they will be common and are important.
(BTW: I suspect that "mustIgnore" at the XML level does not cover such
higher level versioning particularly well.)

Summary
-------

Taken together, the above represent a proposal to focus the finding less
on the details of particular XML constructions, and more on the general
versioning and evolution strategies that are likely to be essential to the
Web's and XML's continued success. Indeed, there's some question as to
whether the most useful finding would continue to focus only on XML, or
also might introduce some general principles applicable to many media
types, and then apply those to XML (or RDF, etc.) in particular. I do
recognize that issue XMLVersioning-41 [12] is currently scoped
specifically to XML.

In general, following the precedent of the Architecture Document [13], we
should explore high-level tradeoffs and principles, somewhat in preference
to making detailed recommendations on syntactic mechanisms. While there's
lots of good work on in the drafts on XML Schema specifics, especially in
Part 2, I think those are only the purview of the TAG insofar as they are
necessary to motivate the broader themes and principles, or are truly
central to the Web's success.

Other details of ensuring that W3C XML Schema is usable to support
versioning scenarios are explicitly in the charter of the XML Schema WG
[14]; indeed, I'm delighted that the TAG and Schema WG are now working
more closely together. I think the general balance should be that the
Schema WG handles the schema-language-specific parts of the problem, with
help from the TAG, and the TAG discusses the broader architectural issues,
with help from (among others) the Schema WG.

There remains a question of whether the TAG will choose to do a formal
finding in this area at all. I am cautiously optimistic that we can and
should, but I do feel that our focus should be more on broader themes,
perhaps including those discussed above. I certainly think it's worth
continued effort in the coming weeks to see whether we can do something
that the community would value.

My recent rereading of the drafts has reminded me once again what a
careful and diligent job Dave has done to take us to this point, and
speaking for myself it is much appreciated! This start will prove to be
very valuable, regardless of how we proceed, or whether any of the
suggestions made above are adopted. I look forward to helping Dave and
Norm in any way that I can to improve the drafts.

Thank you all for your patience with this long note.

Noah

[1] http://lists.w3.org/Archives/Public/www-tag/2004Nov/0071.html
[2] http://lists.w3.org/Archives/Public/www-tag/2005Jan/0018.html
[3] http://lists.w3.org/Archives/Public/www-tag/2005Jan/0019.html
[4] http://lists.w3.org/Archives/Public/www-tag/2005Jan/0020.html
[5] http://lists.w3.org/Archives/Public/www-tag/2004Nov/att-0071/versioning-part1.html#identify
[6] http://www.w3.org/TR/1998/REC-xml-19980210#sec-origin-goals
[7] http://lists.w3.org/Archives/Public/www-tag/2004Nov/att-0071/versioning-part1.html#terminology
[8] http://lists.w3.org/Archives/Public/www-tag/2004Nov/att-0071/versioning-part1.html#div250901096
[9] http://www.w3.org/TR/soap12-part1/#soaprelay
[10] http://lists.w3.org/Archives/Public/www-tag/2004Aug/0010.html
[11] http://lists.w3.org/Archives/Public/www-tag/2004Aug/att-0010/NRMVersioningProposal.html
[12] http://www.w3.org/2001/tag/issues.html?type=1#XMLVersioning-41
[13] http://www.w3.org/TR/webarch/
[14] http://www.w3.org/2003/09/xmlap/xml-schema-wg-charter.html#Deliverables

--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Received on Sunday, 20 February 2005 17:55:09 UTC