- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Sat, 30 Mar 2002 17:07:43 +1100
- To: <www-dom@w3.org>
- Cc: <xml-dev@lists.xml.org>
This post deals with two related issues: A) Abstract Schemas B) Information Item declarations and in relates two two non-W3C technologies: Schematron and Topologi's <informationItem> schema. It then gives the suggestions which I think flow-on in C) Practical Suggestions for DOM AS A) Abstract Schemas ------------------------- The DOM AS draft should not define an abstract schema. It defines a minimal grammar. An abstract schema would have to, by plain language, abstract the common features of all schema languages and paradigms in some way. So the name is quite misleading. I am sure the Schema WG is aware of this, I hope they will look at this again. An abstract schema language would have to provide all of 1) a context traversal policy (e.g. traverse the document in document order) 2) an abstract context selection mechanism (e.g. select each element, or select the element but use the form attribute value instead of the name if Architectural Forms are being used) 3) a context-sensitive validator state function (e.g. grammar based validators traverse through a content model so that x in one context has different followers than in another) 4) a validation-rule traversal policy (e.g. validate attributes early, elements on exit) 5) an abstract validation mechanism (e.g. children and attributes for grammars) 6) error-handling policy 7) create emergent properties for subsequent passes We can use these three things to categorize various schema languages abstractly: Schematron is multiple invocations of (for each active pattern) 1) any traversal policy 2) an XPath 3) no state 4) apply assertions in any order 5) an XML expression 6) implementation specific, but node-based invalidation or branch invalidation is OK 7) N/A DTDs are 1) Document order 2) Select current node 3) grammar state (plus inclusion context in the case of SGML) 4) not defined 5) children content model, for attributes check tokenizing, ID uniqueness 6) fail 7) extract IDs and IDREFs for IDREF checking then we can say that the IDREF checking is a subsequent kind of schema. XML Schemas is something like 1) Document order 2) Select current node 3) grammar state, including local elements 4) validate laxly etc 5) complex and simple content, children and attributes, and uniqueness 6) fail with particular reports 7) extract context for Key and Keyref checking It seems that the DOMs AS mechanism abstracts away 1) and 2). By not providing 3) an element can only be queried "are your contents valid?" but not "are you valid?" B) Information Item Declarations ---------------------------------------- The AS mixes two things: 1) declarations for document integrity 2) constraints for validation. I believe it would be better for these to be treated distrinctly. In Topologi's editor, we provide a file which provides basic declarations for sets of information item declarations. This file can be sent in an XAR application archive. Here is a reduced version. <!-- A DTD for declaring sets of information item names. 2002 (C) Topologi, Pty, Ltd Rick Jelliffe, ricko@topologi.com The top-level element is information item. --> <!ELEMENT informationItems ( elementSets?, attributeSets?, entitySets?, processingSets?, commentSets?, notationSets) > <!ELEMENT elementSets (elementSet+)> <!ELEMENT attributeSets (attributeSet+) > <!ELEMENT entitySets (entitySet+) > <!ELEMENT processingSets (processingSet+) > <!ELEMENT commentSets (commentSet+) > <!ELEMENT notationSets (notationSet+) > <!ELEMENT elementSet (element+)> <!ELEMENT attributeSet (attribute+) > <!ELEMENT entitySet (entity+) > <!ELEMENT processingSet (pi+) > <!ELEMENT commentSet (comment+) > <!ELEMENT notationSet (notation+) > <!ATTLIST elementSet name NMTOKEN #REQUIRED prefix NMTOKEN #IMPLIED sysid CDATA #IMPLIED pubid CDATA #IMPLIED help CDATA #IMPLIED > <!ATTLIST attributeSet name NMTOKEN #REQUIRED prefix NMTOKEN #IMPLIED sysid CDATA #IMPLIED pubid CDATA #IMPLIED help CDATA #IMPLIED > <!ATTLIST entitySet name NMTOKEN #REQUIRED prefix NMTOKEN #IMPLIED sysid CDATA #IMPLIED pubid CDATA #IMPLIED help CDATA #IMPLIED > <!ATTLIST processingSet name NMTOKEN #REQUIRED prefix NMTOKEN #IMPLIED sysid CDATA #IMPLIED pubid CDATA #IMPLIED help CDATA #IMPLIED > <!ATTLIST commentSet name NMTOKEN #REQUIRED prefix NMTOKEN #IMPLIED sysid CDATA #IMPLIED pubid CDATA #IMPLIED help CDATA #IMPLIED > <!ATTLIST notationSet name NMTOKEN #REQUIRED prefix NMTOKEN #IMPLIED sysid CDATA #IMPLIED pubid CDATA #IMPLIED help CDATA #IMPLIED > <!ELEMENT element ANY > <!ELEMENT attribute ANY > <!ELEMENT entity ANY > <!ELEMENT pi ANY > <!ELEMENT comment ANY > <!ELEMENT notation ANY > <!ATTLIST element name NMTOKEN #REQUIRED status ( deprecate | unused | neutral | new ) "neutral" content ( element | mixed | empty | pcdata | cdata | rcdata | default ) "default" help CDATA #IMPLIED > <!ATTLIST attribute name NMTOKEN #REQUIRED status ( deprecate | unused | neutral | new ) "neutral" help CDATA #IMPLIED > <!ATTLIST entity name NMTOKEN #REQUIRED status ( deprecate | unused | neutral | new ) "neutral" content ( xml | sgml | dtd | ndata | cdata) -- cdata means "text", ndata means "binary" -- sysid CDATA #IMPLIED pubid CDATA #IMPLIED help CDATA #IMPLIED > <!ATTLIST pi name NMTOKEN #REQUIRED status ( deprecate | unused | neutral | new ) "neutral" help CDATA #IMPLIED > <!ATTLIST comment name NMTOKEN #REQUIRED status ( deprecate | unused | neutral | new ) "neutral" help CDATA #IMPLIED > <!ATTLIST notation name NMTOKEN #REQUIRED status ( deprecate | unused | neutral | new ) "neutral" help CDATA #IMPLIED > For example, an elementSet gives all the elements in a namespace. Note that there are no schematic rules here: which attributes belong to which elements, or which data types anything can have. An example of a processingSet might be "the PIs that Arbortext Publisher uses". An example of a commentSet might be "Editor comments". (In the Topologi editor, defining these sets allows validation of PIs and comments, which then allows the documents to be robust enough for friendlier automated tools.) I think it is useful to consider this kind of declaration in the light of, for example, James Clark's advocacy against DTDs. As the HTML and MathML working group has discovered, it is not enough merely to make a schema language, all the rest has to be considered too. In the <informationItem> configuration files, we achieve several goals for Topologi system integrators: 1) We define a namespace (a list of names of elements or attributes) 2) We bring comments and PIs to be first-class information items 3) We relieve the schema language from having to worry about entity declarations 4) We expose notations which can be used for any datatypes that cannot be fitted into the schema language (or for xsi:type) 5) By defining all namespace names, we make "open" schema languages even more useful: Schematron rules do not have to enumerate every possible element, but just concentrate on relationships. 6) The contents of the lowest-level elements contains (undisclosed) instantiation information for elements: default values etc. If this <informationItems> system were used by DOM, the Sets would be a hashtable and each Set would be a hashTable, and each item would have a standard interface of its name, some help text, and its status to a system. C) Practical Suggestions for DOM AS --------------------------------------------- 1) The DOM ASModel should be reworked into two separate interfaces: ASNamedInformationItems ASConstraintSets 2) The ASNamedInformationItems interface should expose sets of sets of declarations. These sets should allow various naming methods as appropriate. The declarations should be minimal and be for elements, attributes, PIs, comments, entities, and notations. The use case should be to expose all the information in a Topologi <informationItem> configuration file, which we would contribute as part of the effort if desired. 3) The ASConstraintSet interface should expose a list of ASConstraints objects. Each ASConstraints object corresponds to a particular schema paradigm: I think there are only three really: grammatical constraint, datatype constraints, path-based constraints Each ASConstraints can have more than one ASConstraint object. The grammars in the current DOM AS draft are examples of these, but there could be different ones, e.g. RELAX. Perhaps in order to cop with RELAX NG, the WG should at least provide a content model of "extension" which allows any element in it in any order and occurrence: this would cope with interleave and minimally validate many other things that might come along. Cheers Rick Jelliffe www.topologi.com
Received on Saturday, 30 March 2002 00:55:51 UTC