- From: Matthew Fuchs <matthew.fuchs@commerceone.com>
- Date: Thu, 18 May 2000 18:58:55 -0700
- To: "'www-xml-schema-comments@w3.org'" <www-xml-schema-comments@w3.org>
I think this message from Bell Labs deserves more than a cursory reply. The current document is absurdly long and extremely complex. While just a data definition language, it is significantly longer than the Haskel standard (which Phil Wadler chaired) and close to the size of, if not longer than, the ML spec. Both of these are full programming languages whose specifications contain full formal semantics for the languages. ML, in particular, is fairly large and its data definition component is not significantly less complex than xsdl ought to be. Even worse, we've probably breezed past 8879 - the SGML spec that XML has practically replaced due to its _greater simplicity_, which we are about to remit to the dustbin of history. Generally speaking, unnecessary complexity is a sign of bad design. The leakage of schema information (xsi:type and xsi:null) into instances is an ominous sign - it implies the design is less than crisp. xsi:null worries me because, while it was included to satisfy the needs of the database community, by putting it in the base language, they've lost control over it - there's nothing to to enforce use that in any way accords with sql semantics and there may be a lot of nasty surprises coming down the pike. If the dbms vendors had instead used xsdl to create their own standard mechanism they could have strictly controlled semantics. The recommendation for simpleTypes is very interesting if it can eliminate the content attribute. It also has the significant advantage over the current spec that mixed content is no longer required to be a string. It would be great if we could eliminate the equally redundant derivedBy attribute and unify refinement. The xsi:type issue is strongly connected with the 3rd section on context-independent vs. contex-dependent types. While it is true that the xsi:type issue was discussed at length (and - truth in advertising - I was probably the only one arguing along the same lines as the authors) it was done under the set of assumptions which, while not wrong - wrong is not the right term for this - led to the current rococo result. The issue of context-independent vs. dependent is very important, but decisions have not been made from that perspective. Of course, having a unique URI for everything referenceable from an instance (i.e., elements and types) ensures context-independence and simplifies many things, but the group has not chosen to support that. While I don't think anything should stand in the way of going to CR asap, I hope people will get a better understanding of this and we will be able to modify things accordingly. It's already pretty clear that the number 1 complaint we will hear is "why is it so damn complicated". The unstated part is "when it doesn't need to be". Matthew Message-Id: <200005122308.TAA15405584@nslocum.cs.bell-labs.com> To: &References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"www-xml-schem a-comments@w3.org <mailto:www-xml-schema-comments@w3.org?Subject=Re:%20Simplifying%20XML%20Sch ema&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com> Cc: Mary Fernandez <&References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"mff@research .att.com <mailto:mff@research.att.com?Subject=Re:%20Simplifying%20XML%20Schema&In-Rep ly-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>>, &References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"simeon@resear ch.bell-labs.com <mailto:simeon@research.bell-labs.com?Subject=Re:%20Simplifying%20XML%20Sche ma&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>, &References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"wadler@resear ch.bell-labs.com <mailto:wadler@research.bell-labs.com?Subject=Re:%20Simplifying%20XML%20Sche ma&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com> Date: Fri, 12 May 2000 19:08:51 -0400 From: Philip Wadler <&References=<200005122308.TAA15405584@nslocum.cs.bell-labs.com"wadler@resea rch.bell-labs.com <mailto:wadler@research.bell-labs.com?Subject=Re:%20Simplifying%20XML%20Sche ma&In-Reply-To=<200005122308.TAA15405584@nslocum.cs.bell-labs.com>> Subject: Simplifying XML Schema Simplifying XML Schema ---------------------- The current Schema proposal is complex. Programmers have shown a remarkable ability to put up with complexity, but we do not yet know whether the XML community will be so forgiving. We would like to suggest that it is possible to greatly simplify XML Schema, while not unduly limiting its power. Indeed, some of the suggestions below would both simplify Schema and extend its power at the same time. We are also asking the XML Query working group to support changes along these lines, but in writing this letter we are not acting as representatives of XML Query. Yours sincerely, Philip Wadler, Lucent Jerome Simeon, Lucent Mary Fernandez, AT&T 1. Clear separation between schema and data ------------------------------------------- One of the nice feature of XML is that documents are "self-describing". Schema has two features which run counter to this philosophy, xsi:type and xsi:null. Our motto here is, `Keep schema out of the data!' 1.1. xsi:type ------------- Schema permits refinement in two forms: an element may be declared as being a subclass of another element, and a type may be declared as a subtype of another type. This is explained in Section 4 of the primer. (When one element is a subclass of another element, Schema says the first element is `in the equivalence class' of the second. We use `subclass' because it has the right connotations, whereas `equivalence class' does not.) When subtyping is used without subclassing, the document is required to include type information. Here's an example from Section 4 of the primer. <shipTo export-code="1" xsi:type="ipo:UK-Address"> <name>Helen Zoe</name> <street>47 Eden Street</street> <city>Cambridge</city> <postcode>CB1 1JR</postcode> </shipTo> <billTo xsi:type="ipo:US-Address"> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> If subclassing is combined with subtyping, the use of xsi:type can be avoided. <shipTo export-code="1"> <UK-Address> <name>Helen Zoe</name> <street>47 Eden Street</street> <city>Cambridge</city> <postcode>CB1 1JR</postcode> </UK-Address> </shipTo> <billTo> <US-Address> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </US-Address> </billTo> This latter form is easily read by anyone who understands XML, even if they do not understand XML Schema. We feel that the extra complexity of xsi:type outweighs any of its advantages. We suggest that subtyping be tied to subclassing, and that xsi:type be removed. 1.2. xsi:null -------------- Reading between the lines, it seems clear that xsi:null is included in Schema to support some ways of using relational databases. That is, Schema is trying to help Query. But it is not at all clear what the Query group will decide about nulls. We believe that xsi:null should be removed from Schema. Query should first decide on what mechanism is required for nulls, and then discuss the situation with Schema if Schema support is required. 2. Simple types vs. complex types --------------------------------- One lack of orthogonality in XML Schema Part 1: Structures is that simple types and complex types cannot always be used in the same way. We suggest that simple types be permitted wherever complex types are. This would result in a number of simplifications: * The 'content' attribute (which specifies 'mixed', 'element-only', or 'empty') may be eliminated. * Rather than `mixed', which allows pcdata to appear anywhere, one can specify exactly where pcdata is allowed. * One can specify that the presence of simple types is optional. * This corresponds more directly to SGML and XML DTDs, which indicate mixed content by explicitly mentioning PCDATA. For example, we can now specify a LETTER element that consists of a SALUTATION element, followed by some text, followed by a CLOSING element. <xsd:element name='LETTER'> <xsd:element name='SALUTATION' type='xsd:string'/> <xsd:simpleType type="string"/> <xsd:element name='CLOSING' type='xsd:string'/> </xsd:element> This is more precise than using `mixed', and, because it lists the components in the order they appear, it is easier to read. Of course, types must be parseable and serializable. Usually, values of primitive type can be space separated, the exception being strings (which may themselves contain spaces). Therefore, it is not allowed to specify two successive occurrences of primitive type if one or both of them is a string. 3. Context-independent types vs context-dependent types ------------------------------------------------------- One of the great structuring principles of DTDs is that the elements with the same name always have content of the same type. Many users of SGML take this as the foundation stone for structuring a document. Schema departs from this: the same elements with the same name may have contents of differing type, depending on the context where they appear. However, Schema goes only halfway toward this, as there are some complex restrictions (apparently intended to ease parsing). We suggest that the design should be a good horse or a good elephant, not a hybrid beast. Either choose a completely context-independent design, similar to DTDs, or choose a completely context-dependent approach, similar to that pursued by, for instance, the work on Xduce at the University of Pennsylvania. In mathematicians terms, we should either deal with trees that represent context-free grammars (which can be parsed by top-down deterministic tree automata), or with regular trees (which can be parsed by either bottom-up deterministic, bottom-up non-deterministic, or top-down non-deterministic automata; the three are equivalent). 3.1 Context-independent types ------------------------------ To make types context-independent, all that is needed is to change Schema to only allow global element declarations. Advantages: * Context-independent structuring is simple, and has a long history of use in SGML community. Many users of SGML are gobsmacked over Schema's introduction of so much extra complexity to support features that seem to them to be positively counterproductive. * In Schema, subclasses can be defined only for global elements (as explained in Section 1.3.3.2.3 of Schema Structures). In the Software Engineering community, there is a name for features that look attractive (like local element declarations) but inhibit the use of more powerful structuring techniques (like subclassing). They are called "bad". The context-independent approach restores compatibilty with subclassing. * Context-independent Schema are easy to parse, using top-down deterministic tree automata. Further, parsing can be incremental -- it is not necessary to read the entire tree into memory, and the space required is proportional to the depth of the tree, not its size. * All the current complexity of associating types with elements in the infoset becomes unnecessary. All one needs is a table mapping element names to types. The simpler system will make it easier for other processors to exploit types. 3.2 Context-dependent types ---------------------------- To make types fully context-dependent, Schema should (at least) remove the restriction that all sibling elements with the same name should have the same type. Advantages: * The union of any two schema is a schema, which facilitates the manipulation of multiple documents. With context-independent types, one must use namespaces. * Context-dependent types are more expressive. In particular, they can be helpful for importing other data representations into XML. For instance, one might have two relational tables, A and B, one where ID is an integer, and one where ID is a string. With context- dependent types, these are easily described in a single schema, since an ID element inside an A element may have a different type than an ID element inside a B element. With context-independent types, one must either use renaming, or put the two tables in different namespaces. * The type system may give much more precise information for queries or other applications. Disadvantages: * Parsing is more complex. Parsing may be achieved by either bottom-up deterministic, bottom-up non-deterministic, or top-down non-deterministic; the three are equivalent. Incremental parsing is not always possible, and in the worst case the space required is proportional to the size of the whole tree.
Received on Thursday, 18 May 2000 21:59:17 UTC