Schema validation of XSLT, SVG, XPath : Part 1 Proposal for lists

A few weeks ago, Tim Berners-Lee strongly suggested that other XML
technologies start using XML Schema.  

I have reviewed several of the other XML technologies and believe with some
minor enhancements, XML Schema can do effective validation of these
technologies.  I've broken up the necessary modifications into several
different messages, so that each can be independently considered and
reviewed, however I see all of them as necessary, reasonable and easy to
implement with minimal additional effort.  I would appreciate any comments.

The following note was written after reviewing XSLT but before reviewing
SVG.  SVG makes much more extensive use of lists and so I believe its adds
even more compelling justification for the proposal.

The datatype draft explicitly defers addressing compound types to the next
revision of Schema, however I believe that lists are so essential to
validating these significant XML technologies and so generally useful that
they should be addressed in the initial recommendation.

In XSLT, there are numerous uses of space separated lists, two of which
cannot be addressed with the DTD compatibility NMTOKENS list type.  This
message identifies them, proposes an additional element for XML Schema
Datatypes that would address delimited lists in a  minimally distruptive
manner that would be generally useful and then presents schema fragments for
the XSLT elements.

I believe this is a compelling (even demanding) argument for inclusion of
list support in the initial version of XML Schema.


1. List usage in XSLT:

<xsl:stylesheet
   extension-element-prefixes = tokens
   exclude-element-prefixes = tokens  

<xsl:strip-space elements = nameTests
<xsl:preserve-space elements = nameTests


<xsl:element use-attribute-sets = qnames     //  qname could be done with
NMTOKENS,

<xsl:attribute-set use-attribute-sets = qnames

<xsl:copy use-attribute-sets = qnames

<xsl:output cdata-section-elements = qnames

Only the strip-space and preserve-space elements could not be done with
NMTOKENS, since a nameTest can have '*' and other non-name characters.
However, there would also be value in capturing the fact that qnames and
that extension-element-prefixes should be unqualified names.

Pattern is a list of LocationPathPatterns. However, since
LocationPathPattern is not used separately, the value of having a
LocationPathPattern datatype and Pattern as a list of LocationPAthPattern is
minimal.

RelativePathPattern would appear to be a list of StepPatterns, however since
the delimiter used ("/" or "//") is significant they would not be
appropriate to treat as a generic list.

                                              
------------

2. Proposed solution:

a) Add list element to schema (uses char datatype defined later) -->

<element name="list">
     <archetype>
	 	<attribute name="minOccurs" datatype="non-negative-integer"
default="0"/>
		<attribute name="maxOccurs"
datatype="non-negative-integer"/>
		<!--  absent of separator attribute means no separator
appears    -->
		<attribute name="separator" basetype="char"/>
		<!--  default value (false) means that items can be
separated by only the separator (if any)
		       true would be useful for comma deliminated lists that
have non-significant white space -->
		<attribute name="ignoreExcessWhitespace" datatype="boolean"
default="true"/>
	 </archetype>
</element>

b) add to datatype element and dataQual archetype

  <element name='datatype'>
     <archetype order='all'>
		<element ref="list"/>
        <element ref='basetype'/>
		....
     </archetype>
  </element>


  <element name='datatype'>
     <archetype order='all'>
	    <element ref="list"/>
        <element ref='basetype'/>
		...
     </archetype>
  </element>

c) Add a couple of new built-in datatypes (though not essential, but
generally useful).   (These are also replicated in a following comment on
additional datatypes.)

<datatype name="char">
     <basetype name="string"/>
    <minLength>1</minLength>
    <maxLength>1</maxLength>
</datatype>

<!--  use of qname could result in namespace expansion in type aware
processors  -->
<datatype name="qname">
     <basetype name="nmtoken"/>
</datatype>

<datatype name="ncname">
	<basetype name="nmtoken"/>
	<!--  disallow : character, basetype takes care of assuring nmtoken
production  -->
	<lexicalRepresentation>[^:]*</lexicalRepresentation>
</datatype>

d) remove special narrative about NMTOKENS and IDREFS and redefine NMTOKENS
and IDREFS as:

<datatype name="idrefs">
	<basetype name="id"/>
	<list/>
</datatype>

<datatype name="nmtokens">
	<basetype name="nmtoken"/>
	<list/>
</datatype>

3. Use of list element in XSLT schema

<datatype name="nameTest">
	<basetype name="string"/>
	<literalRepresentation>\*</literalRepresentation>
	<!--  I'm going to make a separate note on multiple literal Reps
	         basically it means that as long as I match one of the
productions
			 I'm acceptible   -->
	<literalRepresentation datatype="nmtoken"/>
</datatype>

<datatype name="nameTests">
	<basetype name="nameTest"/>
	<list minOccur="1"/>
</datatype>
    
<element name="strip-space">
	<archetype>
	    ....
		<attribute name="elements" datatype="nameTests"/>
		....
	</archetype>
</element>

<attribute name="use-attribute-sets">
	<datatype name="qname">
		<list minOccur="1"/>
	</datatype>
</attribute>

4. Processing

The following seems a reasonable processing mechanisms for list (when
separator="," for clarity)

do
   complete production pattern for basetype
   if ignoreWhitespace is true
        match the following regex [&x0A&0x09&0x0D ]*,[&x0A&0x09&0x0D ]*
   else
        match ,
   end if
loop while there is a match

5. Examples of processing

Example a:

<datatype name="strings">
	<basetype name="string"/>
	<list separator=","/>
</datatype>

<element name="nonsense" datatype="strings"/>

Processing any fragment (including the following):

<nonsense>This, is, only, has, one, item, since, nothing, terminates, the,
string, production</nonsense>

will return a one item list since nothing terminates the string production.

Example b:

<datatype name="quotedString">
	<basetype name="string"/>
	<lexicalRepresentation>"[^"]*"</lexicalRepresentation>
</datatype>

<datatype name="quotedStrings">
	<basetype name="quotedString">
	<list separator=","/>
</datatype>

<element name="nonsense" datatype="quotedStrings"/>

Processing the following fragment will result in two items

<nonsense>"I can have my separator (,) in here since","nothing had
terminated my production"</nonsense>

The comma in parenthesis is not processed as an item separator since it was
encountered in the scope of the production pattern for quoted string.

Example c:

<datatype name="floats">
	<basetype name="float"/>
	<list separator=","ignoreExcessWhitespace="false"/>
</separator>

<element name="nonsense" datatype="floats"/>

Processing the following fragment:

<nonsense>3.1415926, 2.718, 1.414</nonsense>

Would result in a validation error, since the space between the first comma
and second number does not match the float production.  If the list element
had been  <list separator=","/>, then it would return 3 items.

<nonsense>3.1415926,,1.414</nonsense>

Would also be a validation error, since the null string between the two
comma's does not match the float production.

6. Accessing lists through a type-aware DOM

I definitely think that trying to define how a type-aware DOM would access
provide access to list data is outside the scope of the schema work.
However, it would not appear that adding generic lists would add any new
issues to that work project since
they would have to address how to provide access to the compatibility lists
of NMTOKENS and IDREFS.  There solution to that problem could be as easy as
saying that their is no native type support for lists and you can only get
the entire string back.  However you will have been assured that the string
meets your production requirements.


7. Additional burden on schema validation code

I believe the additional burden on validation authors would be minimal since
the generic list validation code can replace any IDREFS or NMTOKENS
validation code.  I would appreciate any comments from the Xerces or other
schema parser initiative team on their accessment of the additional
development burden.

Received on Wednesday, 8 December 1999 12:20:14 UTC