XHTML Modularization 1.1: Lazy datatype patterns in XML Schema

Dear HTML editors,

One of the advantages of XML Schema over DTDs is the possibility to verify the validity of attributes with regular expression
patterns.

1) In the current working draft specification for "XHTML Modularization 1.1", chapter 4.3 on "Attribute Types", the MultiLengths
datatype is defined as "A comma separated list of items of type MultiLength"
[http://www.w3.org/TR/2006/WD-xhtml-modularization-20060705/abstraction.html#dt_MultiLengths].

While the "MultiLength" (without 's') is carefully defined in the "XML Schema datatypes module for XHTML"
[http://www.w3.org/TR/2006/WD-xhtml-modularization-20060705/SCHEMA/xhtml-datatypes-1.xsd],
the "MultiLengths" type is still only defined as a banal string, which means that this lazy validation allows almost anything
and does not check for constraints required by the specification.

Lines 120-123 of xhtml-datatypes-1.xsd:

 <!-- comma-separated list of MultiLength -->
 <xs:simpleType name="MultiLengths">
   <xs:restriction base="xs:string"/>
 </xs:simpleType>

A proposition has been made for a more accurate pattern. See [http://lists.w3.org/Archives/Public/www-html/2006Jun/0033.html]
and
[http://lists.w3.org/Archives/Public/www-html/2006Jun/0031.html] for a proposition of improvement, also reported bellow:

 <xs:simpleType name="MultiLengths">
   <xs:annotation>
     <xs:documentation>
       comma-separated list of MultiLength
     </xs:documentation>
   </xs:annotation>
   <xs:restriction base="xs:string">
     <xs:pattern
      value="([+-]?(\d+|\d+(\.\d+)?%)|([1-9]\d*)*\*)(,\s*([+-]?(\d+|\d+(\.\d+)?%)|([1-9]\d*)*\*))*"/>
   </xs:restriction>
 </xs:simpleType>


2) Similarly, the datatypes "ContentType" ("A comma-separated list of media types, as per [RFC2045]") and "ContentTypes" ("A
media type, as per [RFC2045]") are also defined as banal strings, while some basic validation could be done. RFC2045 does
provide with a BNF.

The patterns should of course not list all the possible IANA types [http://www.iana.org/assignments/media-types/], but check at
least for some minimal syntax integrity.

A quickly written proposition (to be tested) only aimed to be illustrative:

ContentType: "([xX]-[a-zA-Z0-9_.+-]+|[a-zA-Z]+)/[a-zA-Z0-9_.+-]+"

ContentTypes: "(([xX]-[a-zA-Z0-9_.+-]+|[a-zA-Z]+)/[a-zA-Z0-9_.+-]+)(,\s*(([xX]-[a-zA-Z0-9_.+-]+|[a-zA-Z]+)/[a-zA-Z0-9_.+-]+))*"


3) Similarly again, the datatype "Charset" ("A character encoding, as per [RFC2045]") should be more strict than a simple
string.


Cordially,
Alexandre
http://alexandre.alapetite.net

Received on Thursday, 6 July 2006 13:26:26 UTC