New issue on XML 1.1 support in XML Schema (and query, etc.)

I don't see a specific action assigned me, but I believe I agreed on an 
earlier conference call to send this note requesting that we open an issue 
relating to XML 1.1 [1] and its impact on our work.  By my reading of the 
Recommendation, XML 1.1 introduces at least the following changes:

* Documents may now be labeled <?xml version="1.1">.  Such a designation 
MAY be used, but is discouraged, if the document could also have been 
serialized as <?xml version="1.0">;  the new designation is required, of 
course, when new features described below are used.

* The set of name characters for element and attribute names has been 
expanded, and indeed is now open-ended:  XML 1.1 allows such names to 
include not just current Unicode characters, but others that may be 
assigned by the Unicode consortium in the future.  As I understand it, the 
distinction between the evolving flavors is not signaled in the XML 
declaration.  Version="1.1" allows any possible future characters, but 
only if the Unicode consortium has assigned them.

* The definition of "char" [2] has been changed to allow previously 
disallowed control characters in the range #x1 through #x1f.

* Some new line end characters have been introduced.  These are handled 
quite early in XML processing, and I don't >think< they cause schema much 
trouble because I don't think they're visible at the Infoset level where 
we work.

I don't claim to have done a balanced or careful analysis of the 
implications for Schema, but the following occur to me as possible areas 
of concern:

* We use Infosets for instances and schemas.  There is a question as to 
how one knows whether the new names and content might appear in such an 
Infoset.  My impression is that it's implied that the switch is to be 
found in the [version] property of the document information item [3]. 
Concerns regarding the Infoset include:
-- While the version property is indeed in the Infoset rec, and the 2nd 
addition talks about needing a processor that can handle whatever 
serialized document you might have, I don't think it specifically ties the 
legal values of properties such as the [local name] of an element or legal 
[character codes] to this [version] property.  Synthetic Infosets, for 
example, need to be covered IMO.  For example, the newly published Infoset 
Rec says [4] "[character code] The ISO 10646 character code (in the range 
0 to #x10FFFF, though not every value in this range is a legal XML 
character code) of the character.", which seems a bit vague on what it 
means to be an XML character.
-- We in schemas define both schema "documents" and instances to be 
validated as element information items, with no reference to a required or 
containing document information item.   I think we need to consider 
whether the [version] property of the doc info item would meet our need to 
determine what version of XML we've got with respect to instances and 
(purported) schema documents.

* Our xsd:string type explicitly refers to the char production of XML 1.0 
2nd addition.  Thus, it will not validate strings containing the control 
characters of XML 1.1.  We could, I suppose, introduce a new type that 
would validate the new content, but there are complications, including:
--xsd:string is  base for types like xsd:token, so we might have to create 
parallel versions of some of those
--If you wanted to write a schema document that had an enumeration or 
fixed value constraint containing the new characters, then that schema 
document would have to be expressed as an XML 1.1 Infoset (see comment 
above regarding possible ambiguity about which Infosets are 1.1)
--Our pattern language [5] is designed to constrain strings, but as I read 
the spec it defines [6] "A normal character is any XML character that is 
not a metacharacter."   With the publication of XML 1.1 we see in 
hindsight that this is insufficiently precise.

* Since the range of legal element names has changed, we face questions 
regarding our ability to validate element and attribute content using the 
new names.
-- If your schema is written as a schema document, then presumably you can 
only enter the names if the document is an XML 1.1 Infoset (similar to 
concern raised for enumerations on strings)
-- Since the range is implicitly extensible as Unicode changes, it would 
seem that even a label of XML 1.1 on an infoset for a schema document does 
not ensure that it has the expressive power to name all the XML element 
and attribute names that one might wish to validate.  Some processor might 
be checking the schema document with knowledge of, say Unicode 4.0, but 
the schema document might have been written with knowledge of a Unicode 
5.0 that "assigned" no characters.
-- We have types such as xsd:name [7] about which our Recommendation says 
"[Definition:]   Name represents XML Names. The ·value space· of Name is 
the set of all strings which ·match· the Name production of [XML 1.0 
(Second Edition)]. The ·lexical space· of Name is the set of all strings 
which ·match· the Name production of [XML 1.0 (Second Edition)]. The ·base 
type· of Name is token. "  Note that xsd:token is derived from xsd:string, 
which is discussed above. 
-- We have an xsd:Qname type, the definition of which is [8] 
"[Definition:]   QName represents XML qualified names. The ·value space· 
of QName is the set of tuples {namespace name, local part}, where 
namespace name is an anyURI and local part is an NCName. The ·lexical 
space· of QName is the set of strings that ·match· the QName production of 
[Namespaces in XML]."  That link to [Namespaces in XML] is explicitly to 
[9]: "World Wide Web Consortium. Namespaces in XML. Available at: 
http://www.w3.org/TR/1999/REC-xml-names-19990114/", which is to the 1999 
Namespaces in XML recommendation. 
-- We use that QName type in the schema for schemas for the names of 
elements and attributes to be validated, as well as for references within 
schemas. 
-- Our component descriptions tend to have {name} properties that 
constrain their content by that same 1999 version of Namespaces.  See for 
example the element declaration schema component [10].  In general, there 
is a necessary tie between what we can put in these component properties, 
what we can express in a serialized schema document, what we can express 
in the corresponding schema document infoset, what's allowed by the 
xsd:Qname type, and the names of elements and attributes we can validate.

* Our type system is used by others such as query, both in the data model 
and as the type system for functions and operators.  As we wrestle with 
the definitions of types like xsd:string and xsd:name, I presume that some 
intensive liaison with them will be needed.  It's not implausible that if 
we introduce an xsd:stringv11 type, that duplicate functions would be 
needed for every F&O function that accepts or returns a string.  Likewise 
for xsd:Qname, etc.  Other groups such as XMLP and RDF also use our type 
system and might be affected by changes or by lack of synergy with XML 1.0 
or XML 1.1.

* We talk about the representation of XML schema documents for retrieval 
on the web [11].  The pertinent part of the description of the web 
resource to be retrieved says [12]: "It resolves to (a fragment of) a 
resource which is an XML document (of type application/xml or text/xml 
with an XML declaration for preference, but this is not required), which 
in turn corresponds to a <schema> element information item in a 
well-formed information set, which in turn corresponds to a valid schema. 
"  It seems we now need to be clearer as to if and when such documents may 
have <?xml version="1.1"?>, what the rules are for cross-importing and 
including across versions, etc.  All of these must be related to whatever 
we decide above regarding rules for our components, types, enumeration 
constraints, etc.

Are we having fun yet?  I must say, I feel somewhat guilty for not having 
noticed these concerns when XML 1.1 was in last call.  I had heard 
anecdotally that it was just the line end stuff and bug fixes, and I 
confess it therefore didn't come up on my priority list for careful 
review.  Did we do a schema WG review, and do we know whether groups like 
Query did?  Or maybe I'm overestimating the complications, as I'm 
sometimes prone to do.

Comments?

Noah

 [1] http://www.w3.org/TR/2004/REC-xml11-20040204/
 [2] http://www.w3.org/TR/2004/REC-xml11-20040204/#NT-Char
 [3] http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.document
 [4] http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.character
 [5] http://www.w3.org/TR/xmlschema-2/#rf-pattern
 [6] http://www.w3.org/TR/xmlschema-2/#dt-normalc
 [7] http://www.w3.org/TR/xmlschema-2/#Name
 [8] http://www.w3.org/TR/xmlschema-2/#QName
 [9] http://www.w3.org/TR/xmlschema-2/#XMLNS
[10] http://www.w3.org/TR/xmlschema-1/#Element_Declaration_details
[11] http://www.w3.org/TR/xmlschema-1/#schema-repr
[12] http://www.w3.org/TR/xmlschema-1/#c-vxd

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Received on Thursday, 19 February 2004 18:55:28 UTC