RE: XHTML Modularization 1.1: Lazy datatype patterns in XML Schema from Alexandre Alapetite on 2006-07-09 (www-html-editor@w3.org from July to September 2006)

From: Alexandre Alapetite <alexandre@alapetite.net>
Date: Sun, 9 Jul 2006 18:03:09 +0200
To: <www-html-editor@w3.org>
Message-ID: <000101c6a371$2c6a1720$f9043f50@athlon1100>
Dear HTML editors,
Following my previous e-mail
[http://lists.w3.org/Archives/Public/www-html-editor/2006JulSep/0004.html] asking for better patterns in XML Schemas for XHTML
Modularization datatypes, where some types are currently defined as simple strings, I come with some concrete propositions for
[http://www.w3.org/TR/2006/WD-xhtml-modularization-20060705/SCHEMA/xhtml-datatypes-1.xsd].

"xhtml-datatypes-1.xsd" with my suggestions is also attached to this e-mail.

In addition to make the XML Schemas for XHTML Modularization more useful by detecting more errors, those patterns will also make
the recommendation less ambiguous and easier to implement.


Charset: "A character encoding, as per [RFC2045]"
--------
See [http://www.iana.org/assignments/character-sets].

Match for instance "ISO-8859-1" but not "ISO-8859-1 UTF-8".

 <xs:pattern value="[a-zA-Z0-9.:_-]+"/>

Note: As far as I can see, there is no definition of "character encoding" in RFC 2045, which uses instead the term "character
set" (see chapter 2.2). Furthermore, RFC 2045 is not very helpful to define what a character set is, and does not tell anything
for the format of the Charset datatype. The XTHML Modularization recommendation should be updated and be more precise. Providing
an official regular expression would do the job.


MultiLengths: "A comma separated list of items of type MultiLength"
-------------
See [http://www.w3.org/TR/1999/REC-html401-19991224/present/frames.html#h-16.2.1.1].

Match for instance "50%, 50%", "30%,400,*,2*" but not "50" and not "50% 50%".

 <xs:pattern value="([+-]?(\d+|\d+(\.\d+)?%)|([1-9]\d*)*\*)(,\s*([+-]?(\d+|\d+(\.\d+)?%)|([1-9]\d*)*\*))*"/>

This proposition was already commented in
[http://lists.w3.org/Archives/Public/www-html/2006Jun/0031.html] and
[http://lists.w3.org/Archives/Public/www-html/2006Jun/0033.html].


ContentTypes: "A comma-separated list of media types, as per [RFC2045]"
-------------
See ContentType bellow. If <ContentType> is the regular expression for ContentType, so

<ContentTypes> = <ContentType>(,\s*<ContentType>)*


ContentType: "A media type, as per [RFC2045]"
------------
See RFC 2045 [http://www.ietf.org/rfc/rfc2045.txt], RFC 2822 [http://www.ietf.org/rfc/rfc2822.txt] and
[http://www.iana.org/assignments/media-types/].

Match for instance "text/plain" and "text/plain; charset=us-ascii" and 'text/plain; charset="us-ascii"' but not "text" or
"text/plain; charset=us-ascii (Plain text)".

Note: It is not very obvious, but I have assumed that parameters such as "; charset=us-ascii" were allowed for the ContentType
datatype, but not comments such as " (Plain text)". Indeed, RFC 2045 refers to RFC 822 which says in chapter 3.4.3 that
"comments  must  NOT  be  included  in other  cases,  such  as  during  protocol  exchanges with mail servers". If people think
comments should be allowed, the following regular expression can easily be modified. In any case, the XHTML Modularization
should be clearer about what is allowed, since it is not obvious, and referring to RFC 2045 is imho not precise enough. Giving
an official regular expression will remove this ambiguity.

- Since the regular expression for ContentType can be quite long, I propose first a shorter version, less strict:

 <xs:pattern value="[^/ ;,=]+/[^/ ;,=]+(;\s*[^/ ;,=]+=([^/ ;,=]+|&quot;[^&quot;]+&quot;))*"/>

Explanations:

 [^/ ;,=]+      # Content type
 /              # Separator type/subtype
 [^/ ;,=]+      # Content subtype
 (;             # Delimitor for optional parameter
  \s*           ## Optional space
  [^/ ;,=]+     ## Name of the parameter
  =             ## Separator parameter=value
  (             ## Value of the parameter
   [^/ ;,=]+    ### The value can be a token
   |            ### or
   "[^"]+";     ### a quoted-string
                ### (The double quote " is written &quot;
                ### in an XML attribute)
  )             ##
 )*             # There can be 0 or more parameters


- Here is now a more precise version for ContentType:

 <xs:pattern
value="([xX][-.][!#$%&amp;'*+-.0-9A-Z\^_`a-z{|}~]+|[a-zA-Z]{4,})/([xX][-.][!#$%&amp;'*+-.0-9A-Z\^_`a-z{|}~]+|[a-zA-Z0-9._+-]+)(;
\s*[!#$%&amp;'*+-.0-9A-Z\^_`a-z{|}~]+=([!#$%&amp;'*+-.0-9A-Z\^_`a-z{|}~]+|&quot;[^&quot;]+&quot;))*"/>
 
Explanations:

According to RFC 2045, a token is [!#$%&'*+-.0-9A-Z\^_`a-z{|}~]+

 (                                         # The content type can be
  [xX][-.][!#$%&'*+-.0-9A-Z\^_`a-z{|}~]+   ## a non-standard name starting by
                                           ## X- or X. followed by a token
                                           ## (The ampersand & is written
                                           ## &amp; in XML)
  |                                        ## or
  [a-zA-Z]{4,}                             ## one of the IANA content-types
                                           ## such as application, audio, image,
                                           ## message, model, multipart, text, video
 )                                         #
 /                                         # Separator type/subtype
 (                                         # The content subtype can be
  [xX][-.][!#$%&'*+-.0-9A-Z\^_`a-z{|}~]+   ## a non-standard name starting by
                                           ## X- or X. followed by a token
  |                                        ## or
  [a-zA-Z0-9._+-]+                         ## one of the IANA content subtypes
                                           ## such as html, xhtml+xml
 )                                         #
 (;                                        # Delimitor for optional parameter
  \s*                                      ## Optional space
  [!#$%&'*+-.0-9A-Z\^_`a-z{|}~]+           ## The name of the parameter is a token
  =                                        ## Separator parameter=value
  (                                        ## Value of the parameter
   [!#$%&'*+-.0-9A-Z\^_`a-z{|}~]+          ### The value can be a token
   |                                       ### or
   "[^"]+"                                 ### a quoted-string
                                           ### (The quotation mark " is written &quot;
                                           ### in an XML attribute)
  )                                        ##
 )*                                        # There can be 0 or more parameters


Those regular expressions propositions require of course more testing against appropriate RFCs, W3C recommendations and test
cases.

I think the IANA [http://www.iana.org] should be invited to review those XHTML Modularization datatype definitions.

Cordially,
Alexandre
http://alexandre.alapetite.net
Attachments

application/octet-stream attachment: xhtml-datatypes-1.xsd
Received on Sunday, 9 July 2006 16:03:13 UTC