[XMLVersioning-41] Comments and Suggestions on Draft Extensibility Finding

Background
----------

Dave Orchard is leading the TAG's effort on extensibility and versioning, 
and with help from co-editor Norm Walsh, Dave has been writing an 
extensive two part draft finding.  Copies of a revised draft were posted 
to this list in November, just before the TAG's Cambridge F2F  [1].  Few TAG members read the revisions in time for the meeting, but Dave 
did walk us through them.  Dan Connolly submitted some comments later [2] 
which generated a bit of discussion [3,4].

At the meeting, I indicated that I thought the drafts would benefit from 
more focus on framing the broader issues relating to versioning, XML and 
the Web, perhaps at the expense of some details relating to XML Schema 1.0 
and particular XML versioning idioms.  Such broader issues might include: 
1) how versioning and extensibility choices affect the utility and 
stability of XML-based Web technologies and 2) investigation of a somewhat 
broader range of XML use cases, and 3) deeper exploration of the general 
characteristics that we might want from any particular solutions.

The TAG assigned me an action to make more detailed suggestions, and to 
help Dave moving forward.  This note is in fulfillment of the first part 
that assignment, I.e. to set out some of the directions I'd like to see 
explored.  I hope to work informally with Dave on whether and how to 
integrate these ideas.  I'm sure we'll have lots of opportunity to talk at 
the plenary.  I should say that overall I like a lot of what he and Norm 
have written, and I hope these will be viewed as constructive suggestions.

Overview of Comments, Suggestions, Concerns
-------------------------------------------

I. Pros and cons of extensibility

The "first rule" introduced in the draft is a Good Practice Note (GPN) 
that says [5]:  "Allow Extensibility rule: Languages SHOULD be designed 
for extensibility."  Other GPNs advocate specific idioms for doing this. 
In my opinion, this somewhat jumps to a conclusion regarding one of the 
most difficult and important tradeoffs relating to extensibility:  when do 
the benefits outweigh the costs?

I think it's fair to say that some of the most successful Web technologies 
have succeeded as much from the ways that they are inflexible as from the 
ways that they are extensible.  XML, which is arguably a success, had as 
one of its original goals: "The number of optional features in XML is to 
be kept to the absolute minimum, ideally zero."[6]. Except for the ability 
to define your own element and attribute names and choose character 
encodings, XML is remarkably inflexible and not particularly extensible. 
Sometimes that's frustrating:  we couldn't use XML Schema in place of DTDs 
in the internal subset, and it's proving very hard to roll out the new 
content conventions for XML 1.1.  Users rightly value the very high 
compatibility that results from XML's inflexibility.  Although the draft 
correctly cites HTML's open content and "must ignore" tag rules as a 
success, there have also been serious interoperability problems as various 
vendors exploited that flexibility to introduce their own flavors of HTML. 
 

I suspect that similar tradeoffs will apply as XML vocabularies are 
designed for other purposes:  extensibility tends to stand in opposition 
to interoperability, and both are important.  I think the finding would be 
much stronger if it explored such tradeoffs, and gave some more nuanced 
guidance as to when things should be locked down and when they should be 
extensible.  In fact, such analysis could be one of the essential 
contributions of the finding.  Yes, the answer is often to provide for 
certain forms of extensibility, but we shouldn't recommend that blindly. I 
think this is a subtle question that's particularly appropriate to the 
scope and mission of the TAG.

II. Relationship to namespaces

The recent semi-permathread on immutability of namespaces suggests that 
the community would welcome a lucid analysis of the relationship of 
namespaces to vocabularies, languages and to versioning of both.   Part 2 
of the drafts does discuss various strategies, but the permathread 
suggests that the community is looking for >principles< relating to the 
immutability or lack thereof of a namespace, principles relating the use 
of namespaces to the deployment of language versions and schemas, and 
perhaps principles explaining what role if any namespaces should play in 
determining how an application should interpret dialects of the 
vocabularies that it processes.

III. Dealing with partial understanding

The draft introduces definitions like "forwards-compatible" [7]: 

"A language change is forwards compatible if older processors can process 
all instances of the newer language."

It also suggests that [8]:

"Forwards compatibility can only be achieved by providing a substitution 
mechanism for Version 2 instances or Version 1 extensions to V1 without 
knowledge of V2.  A V1 consumer must be able to transform any instances, 
such as V1 + extensions, to a V1 instance in order to process the 
instance."

The finding would be stronger if it stepped up to the fact that processing 
is a matter of degree.  In an extensible system, it's common that even an 
early version of an application will have partial ability to process 
features introduced later.  Consider a new element introduced into a 
vocabulary.  Can it be completely ignored, I.e. safely eliminated by a 
substitution?  Well, I suspect that if there is a signature on the 
document then the new element is signed along with the others, even if not 
otherwise processed.  If you save the document on disk, do you not save 
the elements you didn't understand in detail?  Maybe; it depends why 
you're saving.  If you're a SOAP intermediary, do you relay the 
misunderstood elements?  SOAP gives you an attribute [9] that allows you 
to request such relay of content that was not otherwise understood, and 
SOAP specifically allows content from such elements to be used as input to 
other processing (e.g. digital signatures, logging, etc.).  If you have 
function to print an XML document, do you print content from the new 
element?  Perhaps not, but you might also have default printing rules or 
heuristics that you could use.  The version 4 word processor mentioned in 
[7] may indeed successfully read version 5 documents, but may produce 
sub-optimal or incorrect output from some of them.  All of these are 
examples of systems in which partial understanding leads to useful 
processing.  Furthermore, if two different applications are deployed based 
on version 1 of a language, those applications may differ in their ability 
to deal with contrstucts that are introduced later.

I think the drafts jump a bit too quickly to proposals like "a 
substitution mechanism" and "mustIgnore", and thus obscure important 
issues relating to partial understanding.  Indeed, I'm not convinced that 
simple substitution mechanisms are the right framework for dealing with 
partial interoperation.

By accurately modeling a more variable notion of compatibility, it also 
becomes possible to explore a question that the schema WG has been 
considering in detail:  how can a schema language help an application to 
sort out its different levels of understanding of particular content (e.g. 
what the application should store, what it should print, which content 
should be processed with what conventions)?  Various options have been 
suggested, including:  (a) because W3C XML schemas uniquely attribute each 
element in an instance to a particle in a schema content model, you can 
tell which elements were validated by wildcards -- that might suggest 
content you can tolerate but don't fully understand; (b) validate various 
subsets of the document (different substitutions) against multiple schemas 
or in various forms of fallback mode when content is not found to be fully 
valid.  The point is that, to explore such questions, you have to be very 
careful with assumptions about what it means for an application to 
"process" an instance, and how such assumptions relate to schema validity. 
 

Thus, I think the finding should more carefully deal with partial 
understanding of language constructs, and the relationship to schemas.

IV. Need general guidelines for XML and Schema solutions

I think it's healthy to set up goals and success criteria separately from 
proposed solutions.  The draft does some of this, insofar as it makes the 
case that flexible extensibility is a goal.  I think there are some more 
detailed goals that should be set out or considered before getting into 
particular XML and Schema idioms.  Some that occurred to me are in the 
white paper I wrote last year [10,11], including:

* The same vocabulary may be versioned or fixed repeatedly.  Accordingly, 
any general approach should be convenient to use even after 20 or 30 such 
revisions.  Both instances and schemas of the later versions should be 
easy to create and use.

* The versioning mechanisms should (in most cases) not presume particular 
instance constructions such as <extension> elements.

* In some but not in all cases, some degree of forward and/or backward 
compatibility is be required:  I.e. it should be possible but not 
essential to write early schemas that will somehow accept content that is 
not fully defined until later, and schemas for later versions will often 
but not always validate earlier forms of the vocabulary.  (The draft does 
cover this one, I think.)

* Conversely, breaking changes should not in all cases be forbidden.  For 
example, it may be that an early construct is deprecated at some later 
time, and perhaps completely disallowed eventually.  Likewise, later 
versions may introduce constructs that are rejected outright by earlier 
ones.

* It should be possible to check for or force various sorts of forward or 
backward compatibility when desired (this is the notion of partial 
recognition and processing, mentioned in III above).

* Schemas for versions of a vocabulary may but need not form a sequence or 
tree, in which later versions somehow directly reference particular schema 
documents for earlier versions.  This flexibility allows for possible 
redefinition of the same vocabulary by multiple organizations or in more 
than one schema (e.g. there's a debug schema and a production schema, 
neither based explicitly on the other). 

* A consequence of the point above is that the schema for version x is not 
necessarily expressed as a delta on or by direct reference to the schema 
for version x-1, if in fact the versions form a sequence at all.  Such 
incremental definition schemes are convenient, but do not necessarily 
scale to the case where the same vocabulary is revised 20 or 30 times.  In 
such a case one would need up to 30 schema documents to assemble the 
effective schema.   Thus, such incremental schemes should be allowed where 
useful, but not presumed in all cases.

* No unnecessary assumptions should be made regarding the relationships 
between vocabularies and XML Namespaces.   Often, a vocabulary will be 
expressed primarily as a single XML namespace.  Often, to maintain forward 
and backward compatibility, that same namespace will be used in subsequent 
versions as well.  Nothing in the overall XML mechanisms to support 
versioning (e.g. schema language constructs) should prohibit the use or 
coordinated evolution of multiple namespaces to define one or more 
languages, the addition of new namespaces in subsequent versions of a 
language, etc. (Here I admit I'm staking out a personal position on the 
Namespaces question raised in II above).

The above is NOT necessarily the right list, but I think the finding would 
make a contribution if it set out such principles separately from any 
proposed solutions.  If we do retain a Part 2 that discusses particular 
extensibility idioms, then they should each be rated against explicit 
goals such as the examples listed above.

V. The relationship between syntax and semantics

Though it mentions other options in passing, the finding deals primarily 
with examples in which the syntax of the XML more or less directly models 
the evolving semantics of the underlying data or application.  For 
example, a given parent element may allow for elements or attributes to be 
introduced to express features of the language as it evolves.  This is 
indeed a common idiom, and it's appropriate that the drafts explore it.

Nonetheless, such approaches do not cover the full spectrum of common 
mechanisms for versioning XML vocabularies.  Perhaps, as in SOAP encoding 
or RDF, the XML is a serialization for a higher level model, versioning of 
which is not well expressed at the element and attribute level.  We should 
go into more detail about the implications for XML and schemas, I think. 
Sometimes new versions of a language specify coordinated updates to the 
use of or constraints on the contents of elements or attributes scattered 
throughout a document.  Perhaps an attribute changes the meaning of a 
legacy element (e.g. currency="peso").  Perhaps the specification of a 
SOAP header requires that it be used with other headers (which may be 
interspersed with other headers).  In all these cases, it becomes 
difficult to tell the versioning story entirely in terms of XML elements 
and attributes, and it's often problematic to do a useful job of 
expressing the pertinent constraints in XML Schema languages.

In such systems, the extensibility of semantics is only indirectly related 
to the syntactic structure of the XML.  If the finding is to achieve its 
goal of exploring the versioning of XML vocabularies, then it's as 
important to either deal with such approaches, or to make the case that 
they are not important.  I think they will be common and are important. 
(BTW: I suspect that "mustIgnore" at the XML level does not cover such 
higher level versioning particularly well.)

Summary
-------

Taken together, the above represent a proposal to focus the finding less 
on the details of particular XML constructions, and more on the general 
versioning and evolution strategies that are likely to be essential to the 
Web's and XML's continued success.  Indeed, there's some question as to 
whether the most useful finding would continue to focus only on XML, or 
also might introduce some general principles applicable to many media 
types, and then apply those to XML (or RDF, etc.) in particular.  I do 
recognize that issue XMLVersioning-41 [12] is currently scoped 
specifically to XML.

In general, following the precedent of the Architecture Document [13], we 
should explore high-level tradeoffs and principles, somewhat in preference 
to making detailed recommendations on syntactic mechanisms.  While there's 
lots of good work on in the drafts on XML Schema specifics, especially in 
Part 2, I think those are only the purview of the TAG insofar as they are 
necessary to motivate the broader themes and principles, or are truly 
central to the Web's success. 

Other details of ensuring that W3C XML Schema is usable to support 
versioning scenarios are explicitly in the charter of the XML Schema WG 
[14];  indeed, I'm delighted that the TAG and Schema WG are now working 
more closely together.  I think the general balance should be that the 
Schema WG handles the schema-language-specific parts of the problem, with 
help from the TAG, and the TAG discusses the broader architectural issues, 
with help from (among others) the Schema WG. 

There remains a question of whether the TAG will choose to do a formal 
finding in this area at all.  I am cautiously optimistic that we can and 
should, but I do feel that our focus should be more on broader themes, 
perhaps including those discussed above.  I certainly think it's worth 
continued effort in the coming weeks to see whether we can do something 
that the community would value.

My recent rereading of the drafts has reminded me once again what a 
careful and diligent job Dave has done to take us to this point, and 
speaking for myself it is much appreciated!  This start will prove to be 
very valuable, regardless of how we proceed, or whether any of the 
suggestions made above are adopted.  I look forward to helping Dave and 
Norm in any way that I can to improve the drafts.

Thank you all for your patience with this long note.

Noah


[1] http://lists.w3.org/Archives/Public/www-tag/2004Nov/0071.html
[2] http://lists.w3.org/Archives/Public/www-tag/2005Jan/0018.html
[3] http://lists.w3.org/Archives/Public/www-tag/2005Jan/0019.html
[4] http://lists.w3.org/Archives/Public/www-tag/2005Jan/0020.html
[5] http://lists.w3.org/Archives/Public/www-tag/2004Nov/att-0071/versioning-part1.html#identify
[6] http://www.w3.org/TR/1998/REC-xml-19980210#sec-origin-goals
[7] http://lists.w3.org/Archives/Public/www-tag/2004Nov/att-0071/versioning-part1.html#terminology
[8] http://lists.w3.org/Archives/Public/www-tag/2004Nov/att-0071/versioning-part1.html#div250901096
[9] http://www.w3.org/TR/soap12-part1/#soaprelay
[10] http://lists.w3.org/Archives/Public/www-tag/2004Aug/0010.html
[11] http://lists.w3.org/Archives/Public/www-tag/2004Aug/att-0010/NRMVersioningProposal.html
[12] http://www.w3.org/2001/tag/issues.html?type=1#XMLVersioning-41
[13] http://www.w3.org/TR/webarch/
[14] http://www.w3.org/2003/09/xmlap/xml-schema-wg-charter.html#Deliverables


--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Received on Sunday, 20 February 2005 17:55:09 UTC