- From: Philip Wadler <wadler@research.bell-labs.com>
- Date: Fri, 12 May 2000 19:08:51 -0400
- To: www-xml-schema-comments@w3.org
- Cc: Mary Fernandez <mff@research.att.com>, simeon@research.bell-labs.com, wadler@research.bell-labs.com
Simplifying XML Schema ---------------------- The current Schema proposal is complex. Programmers have shown a remarkable ability to put up with complexity, but we do not yet know whether the XML community will be so forgiving. We would like to suggest that it is possible to greatly simplify XML Schema, while not unduly limiting its power. Indeed, some of the suggestions below would both simplify Schema and extend its power at the same time. We are also asking the XML Query working group to support changes along these lines, but in writing this letter we are not acting as representatives of XML Query. Yours sincerely, Philip Wadler, Lucent Jerome Simeon, Lucent Mary Fernandez, AT&T 1. Clear separation between schema and data ------------------------------------------- One of the nice feature of XML is that documents are "self-describing". Schema has two features which run counter to this philosophy, xsi:type and xsi:null. Our motto here is, `Keep schema out of the data!' 1.1. xsi:type ------------- Schema permits refinement in two forms: an element may be declared as being a subclass of another element, and a type may be declared as a subtype of another type. This is explained in Section 4 of the primer. (When one element is a subclass of another element, Schema says the first element is `in the equivalence class' of the second. We use `subclass' because it has the right connotations, whereas `equivalence class' does not.) When subtyping is used without subclassing, the document is required to include type information. Here's an example from Section 4 of the primer. <shipTo export-code="1" xsi:type="ipo:UK-Address"> <name>Helen Zoe</name> <street>47 Eden Street</street> <city>Cambridge</city> <postcode>CB1 1JR</postcode> </shipTo> <billTo xsi:type="ipo:US-Address"> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </billTo> If subclassing is combined with subtyping, the use of xsi:type can be avoided. <shipTo export-code="1"> <UK-Address> <name>Helen Zoe</name> <street>47 Eden Street</street> <city>Cambridge</city> <postcode>CB1 1JR</postcode> </UK-Address> </shipTo> <billTo> <US-Address> <name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip> </US-Address> </billTo> This latter form is easily read by anyone who understands XML, even if they do not understand XML Schema. We feel that the extra complexity of xsi:type outweighs any of its advantages. We suggest that subtyping be tied to subclassing, and that xsi:type be removed. 1.2. xsi:null -------------- Reading between the lines, it seems clear that xsi:null is included in Schema to support some ways of using relational databases. That is, Schema is trying to help Query. But it is not at all clear what the Query group will decide about nulls. We believe that xsi:null should be removed from Schema. Query should first decide on what mechanism is required for nulls, and then discuss the situation with Schema if Schema support is required. 2. Simple types vs. complex types --------------------------------- One lack of orthogonality in XML Schema Part 1: Structures is that simple types and complex types cannot always be used in the same way. We suggest that simple types be permitted wherever complex types are. This would result in a number of simplifications: * The 'content' attribute (which specifies 'mixed', 'element-only', or 'empty') may be eliminated. * Rather than `mixed', which allows pcdata to appear anywhere, one can specify exactly where pcdata is allowed. * One can specify that the presence of simple types is optional. * This corresponds more directly to SGML and XML DTDs, which indicate mixed content by explicitly mentioning PCDATA. For example, we can now specify a LETTER element that consists of a SALUTATION element, followed by some text, followed by a CLOSING element. <xsd:element name='LETTER'> <xsd:element name='SALUTATION' type='xsd:string'/> <xsd:simpleType type="string"/> <xsd:element name='CLOSING' type='xsd:string'/> </xsd:element> This is more precise than using `mixed', and, because it lists the components in the order they appear, it is easier to read. Of course, types must be parseable and serializable. Usually, values of primitive type can be space separated, the exception being strings (which may themselves contain spaces). Therefore, it is not allowed to specify two successive occurrences of primitive type if one or both of them is a string. 3. Context-independent types vs context-dependent types ------------------------------------------------------- One of the great structuring principles of DTDs is that the elements with the same name always have content of the same type. Many users of SGML take this as the foundation stone for structuring a document. Schema departs from this: the same elements with the same name may have contents of differing type, depending on the context where they appear. However, Schema goes only halfway toward this, as there are some complex restrictions (apparently intended to ease parsing). We suggest that the design should be a good horse or a good elephant, not a hybrid beast. Either choose a completely context-independent design, similar to DTDs, or choose a completely context-dependent approach, similar to that pursued by, for instance, the work on Xduce at the University of Pennsylvania. In mathematicians terms, we should either deal with trees that represent context-free grammars (which can be parsed by top-down deterministic tree automata), or with regular trees (which can be parsed by either bottom-up deterministic, bottom-up non-deterministic, or top-down non-deterministic automata; the three are equivalent). 3.1 Context-independent types ------------------------------ To make types context-independent, all that is needed is to change Schema to only allow global element declarations. Advantages: * Context-independent structuring is simple, and has a long history of use in SGML community. Many users of SGML are gobsmacked over Schema's introduction of so much extra complexity to support features that seem to them to be positively counterproductive. * In Schema, subclasses can be defined only for global elements (as explained in Section 1.3.3.2.3 of Schema Structures). In the Software Engineering community, there is a name for features that look attractive (like local element declarations) but inhibit the use of more powerful structuring techniques (like subclassing). They are called "bad". The context-independent approach restores compatibilty with subclassing. * Context-independent Schema are easy to parse, using top-down deterministic tree automata. Further, parsing can be incremental -- it is not necessary to read the entire tree into memory, and the space required is proportional to the depth of the tree, not its size. * All the current complexity of associating types with elements in the infoset becomes unnecessary. All one needs is a table mapping element names to types. The simpler system will make it easier for other processors to exploit types. 3.2 Context-dependent types ---------------------------- To make types fully context-dependent, Schema should (at least) remove the restriction that all sibling elements with the same name should have the same type. Advantages: * The union of any two schema is a schema, which facilitates the manipulation of multiple documents. With context-independent types, one must use namespaces. * Context-dependent types are more expressive. In particular, they can be helpful for importing other data representations into XML. For instance, one might have two relational tables, A and B, one where ID is an integer, and one where ID is a string. With context- dependent types, these are easily described in a single schema, since an ID element inside an A element may have a different type than an ID element inside a B element. With context-independent types, one must either use renaming, or put the two tables in different namespaces. * The type system may give much more precise information for queries or other applications. Disadvantages: * Parsing is more complex. Parsing may be achieved by either bottom-up deterministic, bottom-up non-deterministic, or top-down non-deterministic; the three are equivalent. Incremental parsing is not always possible, and in the worst case the space required is proportional to the size of the whole tree.
Received on Friday, 12 May 2000 19:10:03 UTC