- From: Arnold, Curt <Curt.Arnold@hyprotech.com>
- Date: Wed, 8 Dec 1999 10:08:10 -0700
- To: "'www-xml-schema-comments@w3.org'" <www-xml-schema-comments@w3.org>
- Cc: "'xml-dev@ic.ac.uk'" <xml-dev@ic.ac.uk>
A few weeks ago, Tim Berners-Lee strongly suggested that other XML technologies start using XML Schema. I have reviewed several of the other XML technologies and believe with some minor enhancements, XML Schema can do effective validation of these technologies. I've broken up the necessary modifications into several different messages, so that each can be independently considered and reviewed, however I see all of them as necessary, reasonable and easy to implement with minimal additional effort. I would appreciate any comments: The following note was written after reviewing XSLT but before reviewing SVG. SVG makes much more extensive use of lists and so I believe its adds even more compelling justification for the proposal. In XSLT, there are numerous uses of space separated lists, two of which cannot be addressed with the DTD compatibility NMTOKENS list type. This message identifies them, proposes an additional element for XML Schema Datatypes that would address delimited lists in a minimally distruptive manner that would be generally useful and then presents schema fragments for the XSLT elements. I believe this is a compelling (even demanding) argument for inclusion of list support in the initial version of XML Schema. 1. List usage in XSLT: <xsl:stylesheet extension-element-prefixes = tokens exclude-element-prefixes = tokens <xsl:strip-space elements = nameTests <xsl:preserve-space elements = nameTests <xsl:element use-attribute-sets = qnames // qname could be done with NMTOKENS, <xsl:attribute-set use-attribute-sets = qnames <xsl:copy use-attribute-sets = qnames <xsl:output cdata-section-elements = qnames Only the strip-space and preserve-space elements could not be done with NMTOKENS, since a nameTest can have '*' and other non-name characters. However, there would also be value in capturing the fact that qnames and that extension-element-prefixes should be unqualified names. Pattern is a list of LocationPathPatterns. However, since LocationPathPattern is not used separately, the value of having a LocationPathPattern datatype and Pattern as a list of LocationPAthPattern is minimal. RelativePathPattern would appear to be a list of StepPatterns, however since the delimiter used ("/" or "//") is significant they would not be appropriate to treat as a generic list. ------------ 2. Proposed solution: a) Add list element to schema (uses char datatype defined later) --> <element name="list"> <archetype> <attribute name="minOccurs" datatype="non-negative-integer" default="0"/> <attribute name="maxOccurs" datatype="non-negative-integer"/> <!-- absent of separator attribute means no separator appears --> <attribute name="separator" basetype="char"/> <!-- default value (false) means that items can be separated by only the separator (if any) true would be useful for comma deliminated lists that have non-significant white space --> <attribute name="ignoreExcessWhitespace" datatype="boolean" default="true"/> </archetype> </element> b) add to datatype element and dataQual archetype <element name='datatype'> <archetype order='all'> <element ref="list"/> <element ref='basetype'/> .... </archetype> </element> <element name='datatype'> <archetype order='all'> <element ref="list"/> <element ref='basetype'/> ... </archetype> </element> c) Add a couple of new built-in datatypes (though not essential, but generally useful). (These are also replicated in a following comment on additional datatypes.) <datatype name="char"> <basetype name="string"/> <minLength>1</minLength> <maxLength>1</maxLength> </datatype> <!-- use of qname could result in namespace expansion in type aware processors --> <datatype name="qname"> <basetype name="nmtoken"/> </datatype> <datatype name="ncname"> <basetype name="nmtoken"/> <!-- disallow : character, basetype takes care of assuring nmtoken production --> <lexicalRepresentation>[^:]*</lexicalRepresentation> </datatype> d) remove special narrative about NMTOKENS and IDREFS and redefine NMTOKENS and IDREFS as: <datatype name="idrefs"> <basetype name="id"/> <list/> </datatype> <datatype name="nmtokens"> <basetype name="nmtoken"/> <list/> </datatype> 3. Use of list element in XSLT schema <datatype name="nameTest"> <basetype name="string"/> <literalRepresentation>\*</literalRepresentation> <!-- I'm going to make a separate note on multiple literal Reps basically it means that as long as I match one of the productions I'm acceptible --> <literalRepresentation datatype="nmtoken"/> </datatype> <datatype name="nameTests"> <basetype name="nameTest"/> <list minOccur="1"/> </datatype> <element name="strip-space"> <archetype> .... <attribute name="elements" datatype="nameTests"/> .... </archetype> </element> <attribute name="use-attribute-sets"> <datatype name="qname"> <list minOccur="1"/> </datatype> </attribute> 4. Processing The following seems a reasonable processing mechanisms for list (when separator="," for clarity) do complete production pattern for basetype if ignoreWhitespace is true match the following regex [&x0A&0x09&0x0D ]*,[&x0A&0x09&0x0D ]* else match , end if loop while there is a match 5. Examples of processing Example a: <datatype name="strings"> <basetype name="string"/> <list separator=","/> </datatype> <element name="nonsense" datatype="strings"/> Processing any fragment (including the following): <nonsense>This, is, only, has, one, item, since, nothing, terminates, the, string, production</nonsense> will return a one item list since nothing terminates the string production. Example b: <datatype name="quotedString"> <basetype name="string"/> <lexicalRepresentation>"[^"]*"</lexicalRepresentation> </datatype> <datatype name="quotedStrings"> <basetype name="quotedString"> <list separator=","/> </datatype> <element name="nonsense" datatype="quotedStrings"/> Processing the following fragment will result in two items <nonsense>"I can have my seperator (,) in here since","nothing had terminated my production"</nonsense> The comma in parenthesis is not processed as an item seperator since it was encountered in the scope of the production pattern for quoted string. Example c: <datatype name="floats"> <basetype name="float"/> <list seperator=","/> </separator> <element name="nonsense" datatype="floats"/> Processing the following fragment: <nonsense>3.1415926, 2.718, 1.414</nonsense> Would result in a validation error, since the space between the first comma and second number does not match the float production. If the list element had been <list separator="," ignoreExcessWhitespace="true"/>, then it would return 3 items. <nonsense>3.1415926,,1.414</nonsense> Would also be a validation error, since the null string between the two comma's does not match the float production. 6. Accessing lists through a type-aware DOM I definitely think that trying to define how a type-aware DOM would access provide access to list data is outside the scope of the schema work. However, it would not appear that adding generic lists would add any new issues to that work project since they would have to address how to provide access to the compatibility lists of NMTOKENS and IDREFS. There solution to that problem could be as easy as saying that their is no native type support for lists and you can only get the entire string back. However you will have been assured that the string meets your production requirements. 7. Additional burden on schema validation code I believe the additional burden on validation authors would be minimal since the generic list validation code can replace any IDREFS or NMTOKENS validation code. I would appreciate any comments from the Xerces or other schema parser initiative team on their accessment of the additional development burden.
Received on Wednesday, 8 December 1999 12:10:41 UTC