Review of XML Schema: Formal Description from Peter Fankhauser on 2001-05-23 (www-xml-schema-comments@w3.org from April to June 2001)

From: Peter Fankhauser <fankp@darmstadt.gmd.de>
Date: Wed, 23 May 2001 11:10:13 +0200
To: <www-xml-schema-comments@w3.org>
Cc: <w3c-xml-query-wg@w3.org>
Message-ID: <KNEAKBHGPLOADKCAOIFNEEAICEAA.fankp@darmstadt.gmd.de>
Hi,

find enclosed the XML Query Working Group review of
XML Schema: Formal Description
W3C Working Draft, 20 March 2001
http://www.w3.org/TR/2001/WD-xmlschema-formal-20010320/

Best regards,

Peter Fankhauser
----

General Comments:

Formalizing XML-Schema Part 1 (structures) is a daunting endeavour. The
editors have done a great job in formalizing some of the key issues
arising in the deployment of XML-Schema Part 1:

(a) normalized universal names for schema components
(b) a reasonably concise syntax for XML Schema to facilitate formalization
(c) a formal characterization of document validation

I've reviewed the document from the perspective of an editor of
the XML Query Formalization WD (in short XQF), checking to which extent
the XML Schema Formalization WD (in short XSF), provides a path to fully
align the XQF's current type system with that of XML Schema.

Some of the most pressing open issues from this perspective are:

(1) Alignment of type system:

how can the XDuce inspired type system of XQF be aligned with
XML Schema? In particular:

(a) how can we repair the undifferentiated notion of a "type variable" in
XQF?

That is: in XQF "type variables" are used for three different purposes:

(a.1) element declarations:

type Bib = bib[book[]]

(a.2) complex type definitions

(which may recurse)

type Part = Complex | Simple

(a.3) model groups

type Book = book[]{1,*}
(which may not recurse)

(b) how can we reflect the component-model of XML Schema?

Section 3 in XSF goes a long way in the right direction;
Moreover, Section 8 in the latest version of XSF/ May 1
(http://www.w3.org/XML/Group/xmlschema-current/formalization/formaldesc.html
)
describes a mapping from XML Schema components to XSF-sorts.

Thus it appears that by adopting XSF sorts (maybe modulo a few (coordinated)
syntactic adaptations) the XQF query type system can be
well aligned with with the XML Schema components.

(2) Type subsumption:

what is a suitable, comprehensive characterization of type subsumption
which takes into account:
(a) type names
(b) the type derivation hierarchy of XML Schema
(c) element substitution groups
(d) structural subsumption of model-groups and of elements with an
anonymous type as content.

The XML Query WG needs this to formalize subtype substitutability in
functions and the static semantics of explicit type declarations.
Currently, the XQF notion of subtype only takes into account (d),
and ignores (a),(b), and (c). Conversely, the XML Schema notion of
"subtype" only takes into account (a) through (c).

More formally, XQF's notion of subtype amounts to:

type1 subtypeOf type2
iff every instance valid for type1 is also valid for type2.

While XSF does formally characterize validation which is aware of (a)
through (c) in Section 6 (Document Validation), it does not give
a constructive method for deciding about type1 subtypeOf type2.
XQF needs such a constructive method and the two groups should
coordinate its design.

(3) Referring to types in the XML Query (XML Path 2.0) Datamodel

The NUN's (normalized universal names) of XSF are great (at least
in their unabbreviated form), and the discussion in Section 2.3
shows rather clearly that ref(type) in the Datamodel can indeed be
realized by NUNs.

So much for the (rather) good news. The bad news is that the document
is not really easy reading, although for the most part the editors have
done a great job in structuring the document well, and in motivating and
explaining the methods and concepts. Nevertheless, I'm afraid
that few readers will really fight their way through, which is
unfortunate - they really miss sth.

Partially, this difficulty is certainly due to the inherent complexity
of the problem. However, in some parts the presentation could be
improved:

(a) more mnemonic names for non-terminals rather than one- or two-
letter categories.
(b) less creativity in inventing new syntax everywhere (e.g. for
documents)
(c) more disciplined use of special characters (the document
can be rendered in Amaya; but (e.g.) IE 5.x, and the Acrobat
Distiller rendering for HTML fail miserably.

I hope that some of the more detailed comments below can help to improve
readability here and there. Most of the comments are editorial, some
of them are about the content. I've tried to prioritize them
into minor/medium/major.

Detailed Comments:

Sec 1/Par 5 (editorial/minor)
-----------------------------

I wondered a while about "context free grammar", until it dawned on me
that this means "context free grammars as opposed to the XML-syntax of
XML-Schema", rather than "XML-Schema as a context free grammar".


Sec 2.1 (content/medium)
------------------------

I like the unabbreviated syntax for NUNs, inspired by axis in XPath 1.0. I
don't like the abbreviated syntax, because it overloads .../foo/Foo
alternating between foo as element-name and Foo as type-name. In addition,
the different meaning of * (wildcard in XPath) and anonymous type in NUNs
is a bit confusing.

Here are few alternatives:
The abbreviated syntax could either not abbreviate the type-axis

type::u/d/type::*/a

or use a seperate abbreviation for type-names, e.g.,

%u/d/%*/a

N.b.: for a while I thought type::* for anonymous types should be
avoided altogether, but this didn't survive closer inspection.


Sec 2.2.1 (editorial/minor)
---------------------------

The abstract syntax a[g] for attribute with name "a" is a bit
misleading (without having gone into the details about sorts in
Section 3.5). You may consider to defer a detailed exposition of
component content to 3.4. The example in 2.2.2 is helpful and
should stay; although an example with some "meaningful" names would
work even better.


Sec 2.3 (editorial/medium)
--------------------------
Some parentheses for the normalized elements with type information
added would improve readability:

a[
  u types (
    t/@b ...
   ...]
  )
 ]

Maybe one can do without a special syntax for documents (and forests(?))
entirely:

<a xsi:type="u",
   %t/@b=(xsi:string)"zero",
   %t/@c=(%s)"1 2">
   <%u/d xsi:type = "%u/d/%*">
     <%u/d/%*/a xsi:type="xsi:string">three</a>
     <%u/d/%*/a xsi:type="xsi:string">four</a>
   </u>
</a>

This extends the XML 1.0 syntax as follows:
(1) use NUNs in start-tags and attribute names
(2) annotate attributes with their type;
by a leading "(type)" (or "{type}", or ...)

This also illustrates the effect of schema validation on an XML-document
to XML 1.0 afficionados. They may not like it, but at least they may
understand it then, and continue to "watch the bits on the wire".


Sec 3.1 (editorial/minor)
-------------------------

The introduction of special "name classes" (a,e,t) for three symbol
spaces, (not to mention s,k,x..) accomplishes brevity, but impedes
readability. One might consider to use more mnemonic abbreviations for
non-terminals, and avoid "name classes" by using unabbreviated syntax.


Sec 3.4 (content/medium)
------------------------

I wonder whether we don't also need a production:
g ::= (g)


Sec 3.6 (content/medium)
------------------------

I wonder whether "element groups" are allowed to contain "type names",
and thereby also choice, sequence, etc. of "type names".
Shouldn't this be model group names?


Sec 3.8 (editorial/minor)
-------------------------

The use of "in" for expressing instance d has type g conflicts with the
use of "in" in Section 4.


Sec 4, General (editorial/minor):
---------------------------------

The special character for "=>" is not rendered on IE5.x. (I substituted
it with "normalizesTo" in my local copy).

I couldn't find a usage of the notation "x notin deref()". I also don't
understand why this isn't "x notin dom(deref())".


Sec 4, Rule for "Extend Attribute Transitive" (editorial/medium):
-----------------------------------------------------------------
x<:y is not yet defined. Please refer to Section 5 in the explanation.


Sec 4, Rules for "Extend Attribute Base" and "Extend Element Base":
-------------------------------------------------------------------
(content/medium)

I don't quite understand the mechanics of these rules.
Are "e" and "a" already NUNs?


Sec 4, Rule for "Constant:" (editorial/minor):
----------------------------------------------

where does the prime in "c'" come from?


Sec 4, Rule for "Untyped Element:" (editorial/medium):
-----------------------------------------------------

this rule is a toughie. Some explanatory prose would help here.


Sec 5.1 (editorial/minor):
--------------------------

"x <: :x2" should say "x <: x2"?


Sec 5.3 (content/major):
------------------------

The model-theoretic definition of restriction needs to be elaborated by a
constructive/algorithmic definition. Here's a start (not taking into
account interleaving, modelgroup names, attribute group names, groups in
parentheses, mixed content)


Empty Sequence:

-----------------
eps <:_res g{0,n}


Empty Choice:

----------
0 <:_res g


Sequence 1:

g1 <:_res g1'   g2 <:_res g2'
-----------------------------
g1,g2 <:_res g1',g2'


Sequence 2:

g1,g2 <:_res g1' or g1,g2 <:_res g2'
------------------------------------
g1,g2 <:_res g1' | g2'


Sequence 3:

g1 <:_res g{m1,n1}  g2 <:_res g{m2,n2}
m1+m2 >= m, n1+n2 <= n
--------------------------------------
g1,g2 <:_res g{m,n}


Choice:

g1 <:_res g
g2 <:_res g
----------------
g1 | g2 <:_res g


Repetition 1:

g{m1,n1} <:_res g1
g{m2,n2} <:_res g2
m1+m2 >= m, n1+n2 <=n
---------------------
g{m,n} <:_res g1,g2


Repetition 2:

g{m,n} <:_res g1 or g{m,n} <:_res g2
------------------------------------
g{m,n} <:_res g1 | g2


Repetition 3:

g <:_res g'  m1>=m2  n1<=n2
---------------------------
g{m1,n1} <:_res g'{m2,n2}


Attribute:

g <:_res g'
-----------------
a[g] <:_res a[g']


Element:

g <:_res g'
-----------------
e[g] <:_res e[g']


N.b. 1: This does not take into account substitution groups and wildcards.

N.b. 2: This does also not take into account cases where the content-type
g is derived by extension from content type g'. That would be as follows
(but I'm not sure whether we want that):

e <: e'  g <: g'
------------------
e[g] <:_res e'[g']


Sec 5.4 (content/medium):
---------------------------

I think the rule should say ("der in deref(x').derivation" instead of "der
= deref(x').derivation", and x' needs to be "wellformed" as well.

|- x'
x' = deref(x).base
der in deref(x').derivation
deref(x).content <:_der deref(x').content
-----------------------------------------
|- x


Sec 6.1, Par 1 (content/major):
-------------------------------

Are the documents to be validated already in normalized form or not?
According to the last par in Sec 6.1. they can be both. What is the
processing model behind this? When validating a document, does one first
normalize, then validate, or vice versa?


Sec 6.1, Rule "Typed Attribute" (editorial/minor):
--------------------------------------------------

The rule should be:

d in s
--------------------
a[s types d] in a[s]


Sec 6.2
-------

I have not reviewed this Section...


Sec 7, Par 3 (editorial/medium):
--------------------------------

I don't understand:
"d --> g (eg writes as g)"
"f" is probably the description fragment?


Sec 7.1 (editorial/medium):
---------------------------

What is "x" in all rules? Where does it come from, what does
it contain?

Generally, the mapping rules came out so badly in my printed version that
I did not review them in detail (but I think I got the general idea...)
Received on Wednesday, 23 May 2001 05:03:15 UTC