XML Query Comments to XML Schema (2nd part) from Paul Cotton on 2000-05-29 (www-xml-schema-comments@w3.org from April to June 2000)

From: Paul Cotton <paulcotton@alumni.uwaterloo.ca>
Date: Mon, 29 May 2000 12:28:34 -0400
To: www-xml-schema-comments@w3.org
Cc: w3c-xml-query-wg@w3.org
Message-Id: <200005291628.MAA20480@tux.w3.org>
Here is the second set of comments from the XML Query Working Group on the
XML Schema last call Working Draft.
    http://www.w3.org/TR/2000/WD-xmlschema-0-20000407/
    http://www.w3.org/TR/2000/WD-xmlschema-1-20000407/
    http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/

In this version, we address the following issues:

  2. XML Query data model related issues
    2.1 Treatment of anonymous types
    2.2 Schema for schemaless documents
    2.3 Treatment of collections
    2.4 Problems with minoccurs and maxoccurs
    2.5 Identity-constraints tables
    2.6 Referential mechanisms across multiple documents
    2.7 Internal representation of datatypes
    2.8 Infoset contributions for simple types
  3. Algebra related issues
    3.1 Operations
    3.2 Treatment of NULLS

This list is not exhaustive and the XML Query WG will provide
additional feedback at a later date.

- Paul Cotton, on behalf of the XML Query WG
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2. XML Query data model related issues
--------------------------------------

2.1 Treatment of anonymous types
--------------------------------

XML Query will require access to explicit schema information for every
element and attribute in order to know, e.g. what kind of operations
are legal on those nodes.

The current prescription in XML Schema Part 1: Structures, section
3.3, is that if the name of the actual type definition "is absent,
schema processors may, but need not, provide a value unique to the
{type definition} of the declaration."  Besides that this unique value
is rather mysterious, Query will require something which is both
mandatory, and consistent with the treatment of named types.

For the query language, we may not need the "identity" of an anonymous
type - wouldn't it be sufficient to have the type definition itself?
For anonymous types, equality of type can reasonably be defined as
structural equivalence.  For heavy users of anonymous types, that
would lead to enormous redundancy in the PSV-infoset, and suggests
that the infoset contributions should also include new Type
Information Items that could be referenced from the EIIs.

This would be advantageous for named types too, to save users of the
PSV-Infoset from having to locate schemas (except for more general
schema investigation).  We would like to offer the following proposal
for consideration:

===============================================================

Schema Infoset Contribution: Element Validated by Type (Structures
3.3):

First, insert a Type Information Item for the actual type definition
into the set of TIIs (see below).  Since the TIIs form a set,
duplicates are not inserted.

[Note that this requires some work on detailed definition of equality
of anonymous types. Also namespaces must be added to named type
definitions to avoid false elimination of apparent duplicates.
However, for anonymous types one probably wants to ignore the
namespace if the types are structurally the same.]

Then add the following to the EII:

[type definition namespace]
[type definition name] - may be absent for anonymous types
[type definition reference] - reference to its TII


Schema Infoset Contribution: Type Information Item

The set of TIIs that need to be referenced within the PSV-Infoset
(except for the builtin simple types - and other types defined within
the Schema spec?).


A TII has the structure of an EII in the Infoset for the schema that
defines the corresponding <simpleType> or <complexType> element.

=================================================================

So navigating a TII would be equivalent to going to the schema and
navigating the type definition.

Basically, a user of the PSV-Infoset would always have the content of
any type definition handy (or known already from the Schema spec if in
that namespace), and would also have the names of named types for
strong type checking where needed.

The TII would carry the simple|complex information, so [type
definition type] is not needed in the element SISC.

Also [type definition anonymous] can be omitted, since it is redundant
with absence or presence of a [type definition name].

2.2 Schema for schemaless documents
-----------------------------------

We do require a standard way to represent the "schema" of documents
which have DTD's or do not have any schema at all.  In particular, we
need to have a representation for the ur-type.

2.3 Treatment of collections
----------------------------

In processing a query, sometimes the order of children in an element
is relevant and sometimes it is not.  In the case where order is not relevant,
additional optimizations may be performed.  It would be helpful if
schema could provide some way to indicate whether the order of the
children is significant.  For instance, this might be done by giving
a type an `ordered' property.  Thus, just as the content of a
non-empty element is always either mixed or elementOnly, it also might
be either ordered or unordered.

2.4 Problems with minoccurs and maxoccurs
-----------------------------------------

A. The default for maxOccurs behaves counter-intuitively. When
   maxOccurs is not explicitly specified, it inherits the value of
   minOccurs (which defaults to 1 if not specified). This is
   confusing. For example, po.xsd in XML-Schema Part-0 (Primer)
   contains the declaration <xsd:element ref="comment" minOccurs="0"/>
   This effectively prohibits comments in the instance-document.

   The XML Query Working Group suggests that Schema require that
   minOccurs and maxOccurs occur together or that Schema normatively
   adopt the default-rule mentioned in Appendix B of XML Schema
   Part-1: "maxOccurs defaults to 1 or minOccurs, whichever
   is greater".

B. The XML Query Working finds the different treatment of the
   properties minOccurs/maxOccurs, fixed, default, and value in the
   XML representation for element-declarations and for
   attribute-declarations confusing.  The XML Query Working group
   suggests to use the same representation for element-declarations
   and attribute-declarations, and constrain the allowed value for
   minOccurs and maxOccurs in attribute-declarations to "0" or "1".
   This would allow queries such as:

   "Select all attributes and elements that may occur at most 1 once"

   to be evaluated more efficiently.

C. There is an inconsistency between '*' and 'unbounded'. Primer uses
   "*" to mean Infinity; Data Type spec uses "*" in appendix B. Other
   places in the spec use "unbounded".


2.5 Identity-constraints tables
-------------------------------

XML Schema Part 1: Structures section 3.10 discusses the Infoset
contributions for identity constraints.

In order to verify that identity constraints are satisfied, it defines
identity-constraint tables to be added to Element Information Items.
These tables in effect would let a query processor find the element
referred to by any keyref.

A. The note at the end of the section says, however, that these tables
   are optional.  Conformant schema processors are *not* required to
   expose them.  This means that a query processor working with a PSV
   Infoset created by a conformant processor that does not expose such
   tables may be forced to reconstruct some or all of them -- possibly
   an expensive process, and clearly unnecessary as the schema
   processor would have created them to check the identity constraints
   and then thrown them away!

   We suggest that all conformant XML Schema processors must be able
   to expose the identity-constraint tables, but need not do so if
   requested otherwise.

B. We would like to request a reformulation as a single
   "identity-constraint index" from which it would also be easy to
   find all the elements whose keyrefs referred to a key.

   A simpler representation would promote interoperability of
   conformant XML Schema processors.  We are thinking both of
   conceptual simplicity and of a corresponding API that could support
   transfer of this information in practice.

2.6 Referential mechanisms across multiple documents
-----------------------------------------------------

Query has a requirement to query across collections of documents,
which implies that we will need referential mechanisms other than URI
references (e.g., keys/keyRefs) across multiple documents.  In version
1, the reference mechanisms defined by Schema are restricted to a
single document.  Mechanisms such as XPointer might address
inter-document references if extended to support the keyRef datatype.
We believe there is a future requirement for referential mechanisms
between documents.

2.7 Internal representation of datatypes
-----------------------------------------

Schema defines datatypes in the PSV Infoset for Query to access. The
PSV is extracted from the XML document by a PSV enabled parser.  The
Query WG is interested in working together with the Schema WG and
other working groups, e.g., DOM, to determine whether the physical
representation of each schema primitive datatype (e.g., floating point
numbers) should be an optional PSV characteristic.  This would
increase interoperability by moving the conversion of datatypes into
the realm of a PSV Schema processor.

Some members of the Query WG believe that this comment encroaches on
implementation details, but would like to further discuss this issue
with the Schema WG.

2.8 Infoset contributions for simple types
------------------------------------------

There are differences in the infoset for simple types (datatypes)
between part 1 and part 2 of the schema spec:

A. The part 1 spec has an [abstract] property.  The part 2 spec does
   not.  


B. The part 1 spec does not have the property [fundamental facets].
   Except for "bounds", the other fundamental facets (equal, order,
   cardinality, numeric) are constant for a base datatype and its
   derived types.  There is no need to represent this constant
   information in the PSV Infoset.

C. The structures spec has 2 properties [base type definition] and
   [primitive type definition].  The datatypes spec has a single
   property [base type definition].  The primitive type can be
   obtained by following the base type chain, but storing the
   primitive type is more efficient for certain kinds of type
   inference.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3. Algebra related issues
-------------------------

3.1 Operations
--------------

There is a need for operations to be defined on base types. Schema
doesn't define any built-in operations or provide any mechanism for
user-defined operations on types. As a result, the Query WG needs to
define these. The Query WG will also need to determine the type of the
arguments to select the right operator (e.g., floating point
vs. integer arithmetic) and do the appropriate type coercion. The type
coercion rules need to be defined.  The Query WG is intending to
define these operations and looks forward to doing this in cooperation
with the Schema WG.

3.2 Treatment of NULLS
----------------------

The Query WG has not reached a consensus regarding the definition of
NULLs.  We expect that the Query WG will submit comments regarding
nulls in the future, once we have determined their potential impact on
the Query algebra. In the interim, we have asked individual members of
the Query WG to send their comments regarding NULLs directly to the
Schema WG. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Paul Cotton, Microsoft Canada mailto:paulcotton@alumni.uwaterloo.ca
17 Eleanor Drive, Nepean, Ontario K2E 6A3
Tel: (613) 225-5445 Fax: (613) 226-6913
Received on Monday, 29 May 2000 12:28:28 UTC