[DM] Untyped data (xs:anyType, xs:anySimpleType, xdt:untypedAny, xdt:untypedAtomic)

Untyped data is one of the significant challenges in the design of the
XML Query type system. Two important criteria for the representation
of untyped data are:

1. We need a way to identify data that is not schema processed

It should be easy for a processor to identify documents or regions for
which no schema processing is done, either because the instance was
not schema validated or for nodes found in a skip-validated region of
a schema. This allows a processor to know that no typed data occurs
within the region. One way to do this is to use xdt:untyped for
elements that have not been schema-processed, and xdt:untypedAtomic
for attributes that have not been schema processed, and to use the
types assigned by XML Schema, including xs:anyType and
xs:anySimpleType, when schema processing has been done.

2. Compatibility with the XML Schema type system.

If a document has been schema-validated, the types used in the
document should be compatible with those given to it by XML
Schema. This is listed explicitly as a goal in our charter.

  It is a goal of the XML Query work to be compatible with the work of
  the XML Schema Working group on XML Schema Part 2: Datatypes (XML
  Schema Part 2) and XML Schema Part 1: Structures (XML Schema Part
  1).  For example, it should be possible to base query predicates on
  the existing DTD or XML Schema Part 1 definition of the content of
  an XML document and on the new data types being defined as part of
  the XML Schema Part 2. In addition the XML Query work will take
  advantage of the formal description of the contents of XML Schema
  defined in XML Schema: Formal Description (XML Schema: Formal
  Description).

When schema processing is done, the Data Model should use the same
type names as XML Schema. We currently map all instances of xs:anyType
to xdt:untypedAny [1] and mapping all instances of xs:anySimpleType to
xdt:untypedAtomic, which means that someone who understands XML Schema
must also understand how our types differ from those in the XML Schema
specification. If XML Schema assigns the type xs:anyType, the Data
Model should use the same type. If XML Schema assigns the type
xs:anySimpleType, this type should be preserved in the Data Model.

This is important not only for the comprehension of those poor souls
who must understand both XML Schema and XQuery's type system, but also
because XQuery and XSLT are not the only systems that use type
information from XML Schema. Since we use different type names,
software based on the PSVI has different type names for untyped data
than software based on the Data Model. A Java or C++ program using a
PSVI API will have the same type names as the Data Model for almost
every other named data type, but not for these two - which means that
someone using XQuery embedded in a Java program, or an XQuery that
makes external calls to Java, must be aware of the two sets of type
names and how they relate to each other. Also, a browser based on the
PSVI representation reports different type names than a browser based
on the Data Model, and debugging tools based on the two different
representations report different type names. This is especially
important since many of us see the Data Model as an important
simplification of the PSVI that may become the basis for many
specifications. It must not be at odds with XML Schema. Our charter
asks us to design a language, not to change the type hierarchy used by
XML Schema. If our language can't match a type in the type hierarchy
or express the type for the purposes of static inference, the solution
is to change the language, not the data model. Our status quo goes
against the charter in a way that hurts interoperability among
specifications and tools.

Jonathan

[1] This is called xdt:untyped in internal drafts that have not yet
been released.

Received on Tuesday, 10 February 2004 18:44:33 UTC