Request for comments on the definition of some data types in XHTML Modularisation

Dear HTML editors,
Thank you for your quick response to  
[http://lists.w3.org/Archives/Public/www-html-editor/2007JanMar/0026.html]  
regarding errors
in XHTML 1.1 schemas.

However, I am worried that you have decided not to update the lazily  
defined data types of XHTML Modularisation.

Since I believe this is an important question, I post to www-html@w3.org  
to request further comments from people interested in
HTML.

== I would be very please to hear other peoples' view on this ==

Please read below.

--- From W3C's HTML Working Group Issue Tracking System ---
[http://htmlwg.mn.aptest.com/cgi-bin/voyager-issues/Modularization-Schemas?id=9715]
> Thank you for your comments.  Most of these issues have been corrected
> internally and we will publish the corrections as soon as possible.
>
> With regard to the lazy type issues, the working group was concerned  
> that making
> the type checking overly constrained had the risk of incorrectly  
> flagging valid
> documents as invalid; this is mostly because the regular expressions and  
> RFCs
> involved are so complicated we are not confident that all legal cases  
> will be
> addressed.
See also  
[http://htmlwg.mn.aptest.com/cgi-bin/voyager-issues/Modularization-Schemas?id=9606]

Since there is apparently an update for XHTML Modularisation coming soon,  
I take the chance to argue again on the real need to
improve - at least a minimum - the definition of some datatypes, currently  
defined as strings. An example of the current
situation in xhtml-datatypes-1.xsd:

	<!-- comma-separated list of media types, as per [RFC2045] -->
	<xs:simpleType name="ContentTypes">
		<xs:restriction base="xs:string"/>
	</xs:simpleType>

I understand your point of not wanting to be incorrectly restrictive, but  
the current situation is far too much on the other
side by allowing any type of content, even the most incorrect ones. The  
XHTML Modularisation recommendation is there to specify
how things should be, and what we can expect to find in XHTML documents.  
The purpose of XML Schema was precisely to be more
detailed, also at the lexical level.

The current situation leaves this task to implementers of e.g. browsers  
and validators, which is not good since this will lead
to various non-standard implementations and therefore incompatibilities. I  
buy your point that the RFC involved are not that
straightforward, and it is therefore even more important to make a  
clarification work in this W3C recommendation.

Finally, the drafts of the W3C recommendations are a unique possibility to  
get some feedback, which is not accessible to
independent implementers.

For helping other peoples to comment, I report again below my suggestions  
of improvements already posted in e.g.
[http://lists.w3.org/Archives/Public/www-html-editor/2006JulSep/0022.html]  
(see also attachment xhtml-datatypes-1.xsd)

The only complex expression is for the data type "ContentType", for which  
I propose two versions. The first simple version is
based solely on the definition provided by the W3C. This is the version I  
would like to see integrated into
xhtml-datatypes-1.xsd since I believe it provides a good enhancement in  
the precision of the definition, without risking to
eliminate valid cases, and without being too much dependant on the RFCs.  
The full version is much more detailed and strict, and
is based on a careful reading of the RFCs and W3C documents involved.


Charset: "A character encoding, as per [RFC2045]"
--------
See [http://www.iana.org/assignments/character-sets].

Match for instance "ISO-8859-1" but not "ISO-8859-1 UTF-8".

  <xs:pattern value="[a-zA-Z0-9.:_-]+"/>

Note: As far as I can see, there is no definition of "character encoding"  
in RFC 2045, which uses instead the term "character
set" (see chapter 2.2). Furthermore, RFC 2045 is not very helpful to  
define what a character set is, and does not tell anything
for the format of the Charset datatype. The XTHML Modularization  
recommendation should be updated and be more precise. Providing
an official regular expression would do the job.


MultiLengths: "A comma separated list of items of type MultiLength"
-------------
See  
[http://www.w3.org/TR/1999/REC-html401-19991224/present/frames.html#h-16.2.1.1].

Match for instance "50%, 50%", "30%,400,*,2*" but not "50" and not "50%  
50%".

  <xs:pattern  
value="([+-]?(\d+|\d+(\.\d+)?%)|([1-9]\d*)*\*)(,\s*([+-]?(\d+|\d+(\.\d+)?%)|([1-9]\d*)*\*))*"/>

This proposition was already commented in e.g.  
[http://lists.w3.org/Archives/Public/www-html/2006Jun/0033.html]. By the  
way, the
XML Schema for "XHTML 1.0 Frameset" has still a major error due to  
MultiLengths
[http://lists.w3.org/Archives/Public/www-html/2006Jun/0031.html].


ContentTypes: "A comma-separated list of media types, as per [RFC2045]"
-------------
See ContentType bellow. If <ContentType> is the regular expression for  
ContentType, so

  <ContentTypes> = <ContentType>(,\s*<ContentType>)*


ContentType: "A media type, as per [RFC2045]"
------------
See RFC 2045 [http://www.ietf.org/rfc/rfc2045.txt], RFC 2822  
[http://www.ietf.org/rfc/rfc2822.txt] and
[http://www.iana.org/assignments/media-types/].

Match for instance "text/plain" and "text/plain; charset=us-ascii" and  
'text/plain; charset="us-ascii"' but not "text" or
"text/plain; charset=us-ascii (Plain text)".

Note: It is not very obvious, but I have assumed that parameters such as  
"; charset=us-ascii" were allowed for the ContentType
datatype, but not comments such as " (Plain text)". Indeed, RFC 2045  
refers to RFC 822 which says in chapter 3.4.3 that
"comments  must  NOT  be  included  in other  cases,  such  as  during   
protocol  exchanges with mail servers". If people think
comments should be allowed, the following regular expression can easily be  
modified. In any case, the XHTML Modularization
should be clearer about what is allowed, since it is not obvious, and  
referring to RFC 2045 is imho not precise enough. Giving
an official regular expression will remove this ambiguity.

- Short version:

  <xs:pattern value="[^/ ;,=]+/[^/ ;,=]+(;\s*[^/ ;,=]+=([^/  
;,=]+|&quot;([^&quot;\\]|\\\\|\\&quot;)*&quot;))*"/>

Explanations:

  [^/ ;,=]+      # Content type
  /              # Separator type/subtype
  [^/ ;,=]+      # Content subtype
  (;             # Delimitor for optional parameter
   \s*           ## Optional space
   [^/ ;,=]+     ## Name of the parameter
   =             ## Separator parameter=value
   (             ## Value of the parameter
    [^/ ;,=]+    ### The value can be a token
    |            ### or
                 ### a quoted-string
    "            #### quotation mark (written &quot; in an XML attribute)
    (            ####
     [^"\\]      ##### any character but a quotation mark " or an  
anti-slash \
     |           ##### or
     \\\\        ##### an escaped anti-slash \\
     |           ##### or
     \\"         ##### an escaped quotation mark \"
    )*           #### the content of a quoted string can
                 #### be 0 or more characters
    "            #### quotation mark
                 ### end of quoted-string
   )             ##
  )*             # There can be 0 or more parameters


- Here is the full, more precise version for ContentType:

  <xs:pattern
value="([xX][-.][!#$%&amp;'*+-.0-9A-Z\\^_`a-z{|}~]+|[a-zA-Z]{4,})/([xX][-.][!#$%&amp;'*+-.0-9A-Z\\^_`a-z{|}~]+|[a-zA-Z0-9._+-]+)
(;\s*[!#$%&amp;'*+-.0-9A-Z\\^_`a-z{|}~]+=([!#$%&amp;'*+-.0-9A-Z\\^_`a-z{|}~]+|&quot;([^&quot;\\]|\\\\|\\&quot;)*&quot;))*"/>

Explanations:

According to RFC 2045, a token is [!#$%&'*+-.0-9A-Z\\^_`a-z{|}~]+

  (                                         # The content type can be
   [xX][-.][!#$%&'*+-.0-9A-Z\\^_`a-z{|}~]+  ## a non-standard name starting  
by
                                            ## X- or X. followed by a token
                                            ## (The ampersand & is written
                                            ## &amp; in XML)
   |                                        ## or
   [a-zA-Z]{4,}                             ## one of the IANA content-types
                                            ## such as application, audio,  
image,
                                            ## message, model, multipart,  
text, video
  )                                         #
  /                                         # Separator type/subtype
  (                                         # The content subtype can be
   [xX][-.][!#$%&'*+-.0-9A-Z\\^_`a-z{|}~]+  ## a non-standard name starting  
by
                                            ## X- or X. followed by a token
   |                                        ## or
   [a-zA-Z0-9._+-]+                         ## one of the IANA content  
subtypes
                                            ## such as html, xhtml+xml
  )                                         #
  (;                                        # Delimitor for optional  
parameter
   \s*                                      ## Optional space
   [!#$%&'*+-.0-9A-Z\\^_`a-z{|}~]+          ## The name of the parameter is  
a token
   =                                        ## Separator parameter=value
   (                                        ## Value of the parameter
    [!#$%&'*+-.0-9A-Z\\^_`a-z{|}~]+         ### The value can be a token
    |                                       ### or
                                            ### a quoted-string
    "                                       #### quotation mark (written  
&quot; in an XML attribute)
    (                                       ####
     [^"\\]                                 ##### any character...
                                            ##### but a quotation mark " or  
an anti-slash \
     |                                      ##### or
     \\\\                                   ##### an escaped anti-slash \\
     |                                      ##### or
     \\"                                    ##### an escaped quotation mark  
\"
    )*                                      #### the content of a quoted  
string can
                                            #### be 0 or more characters
    "                                       #### quotation mark
                                            ### end of quoted-string
   )                                        ##
  )*                                        # There can be 0 or more  
parameters


Those regular expressions propositions require of course more testing  
against appropriate RFCs, W3C recommendations and test
cases.

For people interested in testing, the above definitions are used in my  
XHTML validator
[http://alexandre.alapetite.net/distribution/weblide/].

Cordially,
Alexandre
http://alexandre.alapetite.net

Received on Tuesday, 20 February 2007 15:57:05 UTC