anyURI and references to external specifications - summary of XSD 1.1 design rationale and technical arguments

In the context of bug 6089

    http://www.w3.org/Bugs/Public/show_bug.cgi?id=6089

Murata Makoto has suggested that

    xsd:anyURI of 1.1 should allow LEIRIs of W3C (and IETF) and
    nothing else.  

In the course of trying to figure out how to move forward on this
issue, I have just reviewed the record of the XML Schema WG's
discussions of this and related issues, specifically the issues
originally opened as wd-25, wd-28, and wd-29 in the 1.1 issues list
at

    http://www.w3.org/XML/2004/07/xs11-pre-lc-issues/

These were later transferred into Bugzilla as 

    2751 wd-25: anyURI, RFCs 2396 and 3896
    2754 wd-28: Proposal from the i18n-core wg for changes of anyURI
    2755 wd-29: URI changes in RFC 3986

The rationale for the WG's decisions emerges tolerably well, I
think, from the minutes of the meetings of May 2005 in Morrisville,
North Carolina, and of August 2005 in San Mateo, California:

    http://lists.w3.org/Archives/Member/w3c-xml-schema-wg/2005May/att-0024/Minutes_of_the_W3C_XML_Schema_Working_Group_5th__38th__F2F_meeting.htm

    http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2005Aug/0020.html

Both sets of minutes are member-accessible.  For the benefit of
others, I summarize some of the technical arguments brought forward
in those meetings.

- The datatypes xsd:language and xsd:anyURI both depend on and refer
to external specifications, and in both cases the external
specifications referred to by XSD 1.0 have been made obsolete by new
specifications which specify different syntax for the construct. It
would be desirable to resolve the attendant problems in the same way
for both datatypes.

In the case of xsd:lang, XSD 1.1 specifies a simple regular language
which is a superset of the languages specified by the old and new
specifications for language codes, and type validity in XSD 1.1 is
explicitly not sufficient to guarantee conformance to the external
specification. Some members of the WG wanted a similar solution
here: to specify a simple rule which is a superset of the rules of
the various forms of the various external specifications.

- There were reliable reports that some URIs accepted by RFC 2396
were not accepted by RFC 3986.

- Some members of the WG were and are strongly opposed to any change
which would change any document from valid under XSD 1.0 to invalid
in XSD 1.1. These WG members are also concerned about changes that
loosen the type validity rules and thus move documents from invalid
to valid, but such changes are felt (I believe) to be less damaging
and are not opposed so firecely.

- It was believed at the time by some WG members (including me) that
the language defined by the RFCs for URIs and IRIs was not regular
and thus could not be reduced to a regular expression, so that the
easiest way to specify a superset would be to allow any string as a
type-valid form of anyURI. [Subsequent work has shown that this
belief was wrong: the language of RFC 3986 is regular.]

- Some WG members argued that for realistic validation of URIs, the
RFCs for generic URI syntax are insufficient, because they do not
cover any of the scheme-specific rules of syntax.  They concluded
that requiring conformance to the RFC as a condition of
type-validity was not actually very helpful.

- Some members of the WG felt that XSD 1.0 did not in fact impose
tight constraints on anyURI values (at least, not effectively).  It
does say that

    The ·lexical space· of anyURI is finite-length character
    sequences which, when the algorithm defined in Section 5.4 of
    [XML Linking Language] is applied to them, result in strings
    which are legal URIs according to [RFC 2396], as amended by [RFC
    2732].

but "legal URI" is not a term defined by RFC 2396 or RFC 2732, and
it is not clear at first glance to readers of those specifications
whether they actually intend to define a clearly bounded class of
conforming strings or not, and if so just what it is.  The fact
that multiple grammars are given, which accept different strings,
may be part of the difficulty here.

Some experienced Web programmers have claimed that really the only
strings forbidden by RFC 2396 are strings with more than one #
character in them.  (This also turns out to be not quite true: if
the RFC prohibits anything, is also prohibits strings with the
various prohibited characters.)

On this view, the statement in XSD 1.1 that any string is type-valid
as an instance of anyURI is not so much a liberalization as a coming
clean about the state of affairs.  Some WG members were explicit
that they believed the intent of 1.0 (whether successfully expressed
or not) had been to have a type which allowed pretty much any
string.

- On the dissenting side, some implementors noted that their
implementations did check the rules of RFC 2396 and they had
comments from users suggesting that some users at least do use the
type in the expectation that it will enforce the RFC rules.

- The empirical data available on the behavior of existing XSD 1.0
processor suggested that the existing processors were not consistent
in the rules they checked for URI values.

- Some WG members argued that XSD 1.0 had made a mistake in coupling
the specification of URIs tightly to a specific version of a
specific external specification (RFC 2396); users are better served
by a loose coupling between specifications. Just as HTML validity
does not depend on conformance to a particular version of the URI
specification, so schema-validity should not either. Loose coupling
allows specs to remain stable even as the external specs they refer
to are revised.

In international standards, normative references are often (not
always) accompanied by text which says, roughly

    The following standards contain provisions which, through
    reference in this text, constitute provisions of [this
    specification]. At the time of publication, the editions indicated
    were valid. All standards are subject to revision, and parties to
    agreements base on [this specification] are encouraged to
    investigate the possibility of applying the most recent editions
    of the standards listed below.

Some readers (including me) take this as an indication that
conforming implementations of those ISO specifications are allowed
to support the current version of the other specifications referred
to, without losing their claim to conformance.  Some WG members
(including me) had believed that this was such a self-evidently
necessary rule that no one could read any W3C specification
(specifically including XSD 1.0) as forbidding implementations from
supporting (for example) later versions of the URI spec, or the
language-code spec, or Unicode, or XML, than those listed in the
references.  Experience has taught differently. Some readers,
including some members of the XML Schema WG, read XSD 1.0 as
requiring the use of specific versions of external specifications,
and not allowing upgrades to later versions.  Others may (and do)
believe that those readers are wrong, but it is clear that they
exist.

It's empirically observable that the definitions of URI and IRI have
changed as older versions of the specifications have been replaced
by newer ones.  Some members of the WG felt, when we made this
decision, that it would be better NOT to try to track the details of
external specifications; users (it was argued) would be better
served by being able to use the current version of the IRI spec,
rather than being trapped by their schema processor with an outdated
version of the spec.  Murata-san's original comment illustrates
concisely the problem faced by users when XSD is tightly coupled to
external specifications.

So much for the technical arguments advanced at the time.  Reviewing
the decision record has persuaded me that the arguments for loose
coupling are good ones. I think now that XSD would perhaps do better
to encourage, or even require, implementations to enforce the rules
of *some* implementation-specified version of the relevant RFCs, but
it's clear from the minutes that a proposal to require support for
an implementation-specified RFC would never have gotten anywhere; a
proposal to encourage it without requiring it was in fact made, and
got nowhere.

I hope this helps clarify the design rationale for the current state
of affairs in XSD 1.1 both with regard to IRIs and with regard to
language codes.

CMSMcQ


-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Friday, 17 December 2010 01:45:54 UTC