Re: Versioning of XML Schema and namespaces from Eliot Kimber on 2005-05-11 (xmlschema-dev@w3.org from May 2005)

From: Eliot Kimber <ekimber@innodata-isogen.com>
Date: Wed, 11 May 2005 09:50:42 -0500
To: xmlschema-dev@w3.org
CC: Fraser Crichton <fraser.crichton@solnetsolutions.co.nz>, Dan Vint <dvint@dvint.com>, John.Hockaday@ga.gov.au
Message-ID: <42821BC2.1090807@innodata-isogen.com>
Fraser Crichton wrote:
> Hi,
> 
> I'm very interested in the reasons behind this -
> 
>  > Putting a version in the namespace is definitely not the right thing 
> to do.
> 
> I ask because I've seen that as a possible approach to versioning 
> (http://www.xfront.com/Versioning.pdf) and it seems a number of 
> practitioners have adopted this e.g. the US Dept of Navy, xCIL, etc.

Per the W3C namespace spec, a namespace identifies an abstraction, an 
infinite set of names distinguished from all other possible names by 
having a unique prefix (the namespace URI).

Thus a namespace URI identifies an abstraction--there is no particular 
mechanism defined within the namespace spec for defining what names are 
actually in the namespace. That is, a namespace URI identifies an 
unbounded set of names, that is, an infinite set.

An infinite set cannot meaningfully be versioned because you cannot 
distinguish one version from another (because you can never enumerate 
all its members in order to prove equality or difference).

This is the philosophical reason for not versioning namespaces.

The practical reason derives from this idea of namespaces naming 
unversionable abstractions:

In practice, namespaces are bound to XML "applications" [I put 
"application" in quotes because it's not a precisely-defined term and to 
distinguish it from the narrow usage of _application_ to mean a specific 
software program.] For example, XSLT is an XML application, as are 
DocBook and XHTML. This binding is done in application specifications.

As an abstraction, the XSLT application is invariant over time: its 
basic purpose and usage will always be what it is now, regardless of the 
details of how it is implemented.

Thus, in this use case, namespace URIs represent the abstract idea of 
the application (that is, the concept of XSLT or DocBook or XHTML) and 
that abstract idea cannot be versioned and doesn't change over time.

That is, as long as the fundamental nature of a given application 
doesn't change, it would be inappropriate and unnecessary to change it's 
namespace URI simply because some implementation detail of the 
application changed.

Or said another way, if you change the namespace URI, in any way, you 
are identifying a fundamentally *different* application.

Or said another way, the namespace URI names *all current and future 
versions" of the concrete expressions of the application.

What *does* change are the concrete implementation artifacts that make 
up the application at any point in time. As concrete objects, they are 
versionable and will likely have different versions in time. Thus it is 
appropriate (in fact essential) that the resource locators for those 
concrete objects reflect the versions of them, otherwise you could only 
locate a single version of any one of them, which would be very limiting 
in most cases (for example, if I have two versions of the schema for a 
given application and documents that validate against one version or the 
other).

Thus, while the namespace URI for a given application should be 
invariant, the resource URLs for the concrete implementation components 
(schemas, transforms, java classes, documentation, etc.) will be variant 
as new versions are created. Of course, you might also offer URLs that 
represent the "latest" version--resources may have any number of URLs 
associated with them. But, in the general case, there should always be 
version-specific URLs for the resources.

How can this work in practice?

The best solution, in the abstract, I think, is what Mike suggests, 
namely an attribute that specifies the schema version, which the 
processor then uses to determine the correct schema instance to apply. 
This suggests that it might be useful for the XSD spec (or perhaps a 
separate, more general spec, since this requirement isn't XSD-specific) 
to define a "schema-version" attribute that can be used independently 
from the schemaLocation attribute.

But, given that current software (and certainly the Xerces processor, 
which provides schema-awareness in many tool chains) depends primarily 
on schemaLocation and/or catalogs, I think that a productive approach 
would be as described below.

John Hockaday writes:

> If I don't already have a copy of the
> XSDs referred to in the XML document instances then I need to download those
> XSDs and validate them.  
> 
> If the XSDs are not valid then I report my findings to my clients and reject
> the relevant XML document instances.  If the XSDs are valid then I validate
> the XML document instances against those XSDs and report my findings to my
> clients.  Again only valid XML document instances are accepted.

> If I do have a copy of the XSDs then I will have already validated them and I
> hope to use OASIS Catalogue files to refer to local copies of those XSDs when
> validating related XML document instances.  This will of course reduce
> bandwidth, time and costs and is essential when validating 40,000+ metadata
> records at a time.

Here there are two key and common requirements:

1. Validate documents against whatever schema they say they conform to 
(and, as a side effect, validate the schemas themselves).

2. Provide local copies of schemas to reduce processing time and network 
overhead.

John knows that there may be different versions of schemas for the same 
namespace.

I think the solution here is use the catalogs as follows:

1. Require that incoming documents use absolute URIs for all 
schemaLocation specifications (not sure if this is currently the case in 
John's case).

2. Use the catalog to map these absolute URIs to the local copy of the 
schema (if there is one--if there's not one, fetch it and update the 
catlaog).

3. As a fallback, map namespace URIs to schema URIs, which the 
appropriate schema for that namespace is known.

This does require that when there are different schema versions for a 
given namespace that documents specify the correctly schemaLocation 
value, otherwise John has no choice to be retrieve an arbitrary 
(presumably the latest) version of the schema for that namespace.

In the case where the version has been used in the namespace and there 
is no schemaLocation, the problem is the same: either there's exactly 
one schema for that namespace or John has to arbitrarily pick one.

This all puts the onus on document authors to specify correctly which 
version of a namespace's schema they want to use. There is no way around 
this--it's simply an unavoidable consequence of the fact that there can 
be different versions of a schema for a given namespace.

Note too, that this basic approach can be used to prevent authors from 
using schemaLocation= to nefarious ends where you have the requirement 
that documents conform only to a known, and controlled, set of schemas. 
Because you are remapping the schemaLocation URIs to local files, if 
authors specify a schemaLocation URI that you don't recognize (meaning 
that it's not mapped in the catalog), you can fall back to pointing to 
some local schema that will cause the document in question to fail its 
validation check. This is the functional equivalent of ignoring 
schemaLocation=.

Cheers,

Eliot

-- 
W. Eliot Kimber
Professional Services
Innodata Isogen
9390 Research Blvd, #410
Austin, TX 78759
(512) 372-8155

ekimber@innodata-isogen.com
www.innodata-isogen.com
Received on Wednesday, 11 May 2005 14:50:02 UTC