comments on XHTML Modularization 1.1 from XML Schema WG

Dear colleagues:

On behalf of the XML Schema Working Group, I congratulate the
HTML Working Group on your progress with XHTML Modularization.

As described in the comments below, owing to a snafu
the XML Schema WG did not review the Last Call WD of XHTML
Modularization 1.1 last summer.  In the hopes that the maxim
"better late than never" is true in this case, we transmit
to you now our comments on the document.  My apologies for
the snafu.

Our comments are available at any of the URIs

   http://www.w3.org/XML/Group/2007/02/m12n-of-xhtml.xsd-comments
   http://www.w3.org/XML/Group/2007/02/m12n-of-xhtml.xsd-comments.xml
   http://www.w3.org/XML/Group/2007/02/m12n-of-xhtml.xsd-comments.html

A text version is provided below for those who find it more
convenient.

--C. M. Sperberg-McQueen
   on behalf of the W3C XML Schema WG




Notes on

XHTML Modularization 1.1

    Ed. by

C. M. Sperberg-McQueen

Submitted to the HTML Working Group on behalf of the XML Schema Working
Group

27 February 2007

$Id: m12n-of-xhtml.xsd-comments.html,v 1.1 2007/02/27 22:36:18 cmsmcq
Exp $
      _________________________________________________________

      * 1. [7]Background
      * 2. [8]Substantive comments
           + 2.1. [9]Charset type
           + 2.2. [10]Color type
           + 2.3. [11]ContentType
           + 2.4. [12]Coords type
           + 2.5. [13]FPI type
           + 2.6. [14]FrameTarget type
           + 2.7. [15]LinkTypes type
           + 2.8. [16]Tightening other types
           + 2.9. [17]Named model groups vs. substitution groups
           + 2.10. [18]Adding attributes
           + 2.11. [19]A missing scenario
      * 3. [20]Editorial comments
           + 3.1. [21]Make the introduction less DTD-specific
           + 3.2. [22]The term PCDATA
           + 3.3. [23]Section 4.3 Attribute Types
           + 3.4. [24]Length type: well done
           + 3.5. [25]Shape type
           + 3.6. [26]White space in the document source
      * 4. [27]Comments half substantive and half editorial
           + 4.1. [28]Testing the schema documents
           + 4.2. [29]Where is the html element?
           + 4.3. [30]Case insensitivity and XML Schema patterns or
             enumerations
      _________________________________________________________

    NOTE:
    This document contains comments on the [31]Last Call Working Draft
    of XHTML™ Modularization 1.1. Several different readers formulated
    the comments; the editor has not attempted to unify and organize
    them strictly. The comments are forwarded to the XHTML Working Group
    on behalf of the XML Schema Working Group, but it should be noted
    that the XML Schema Working Group has not had the leisure to
    consider them in detail.

    The Last Call comment period on this draft ended 4 August 2006, so
    these comments are very late. They are being forwarded nonetheless
    in the hopes that even at this late date they may prove useful to
    those responsible for the XHTML Modularization spec.
    To minimize wasted effort, the copy actually consulted is the
    [32]editor's copy of 19 February 2007.

      [31] http://www.w3.org/TR/2006/WD-xhtml-modularization-20060705
      [32] http://www.w3.org/MarkUp/Group/2007/WD-xhtml- 
modularization-20070219/introduction.html

1. Background

    Owing apparently to human error, the XML Schema Working Group failed
    to attend to the publication of the Last Call draft of [33]XHTML
    Modularization 1.1, and consequently failed to review the spec
    during the scheduled last-call comment period.

    We apologize for this oversight; our chair has administered severe
    counseling to our staff contact, and our staff contact has promised
    he will endeavor not to make similar mistakes in future.
    Since HTML and XHTML constitute by far the most widely used
    vocabularies published by any W3C Working Group, the Schema Working
    Group has a deep interest in making sure the formulations of XHTML
    using XML Schema are as useful as possible.

    The following comments have been prepared in haste, in an attempt to
    perform as useful a review as possible.

    The Schema Working Group's previous comments (apparently on the
    [34]Last Call draft of 9 December 2002) are at
    <URL:[35]http://www.w3.org/XML/Group/2003/01/xmlschema-notes-on-xhtm
    l-modularization.html> and were transmitted to the HTML WG in
    <URL:[36]http://lists.w3.org/Archives/Public/www-html-editor/2003Jan
    Mar/0043.html> and
    <URL:[37]http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2003J
    an/0099.html>.

    A quick summary of the earlier comments:

     1. Please use the appropriate simple types.
     2. Exploit substitution groups.
     3. Explain what to do about multiple schemas for same namespace.
     4. Don't declare everything blocked and final!
     5. Sec 2.2.6 is opaque.
     6. Point to external documentation.
     7. Provide internal documentation.
     8. Clarify conformance.
     9. More concrete extension scenarios.
    10. Exhibit structure of schema better.

      [33] http://www.w3.org/TR/2006/WD-xhtml-modularization-20060705
      [34] http://www.w3.org/TR/2002/WD-xhtml-m12n-schema-20021209/
      [35] http://www.w3.org/XML/Group/2003/01/xmlschema-notes-on- 
xhtml-modularization.html
      [36] http://lists.w3.org/Archives/Public/www-html-editor/ 
2003JanMar/0043.html
      [37] http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/ 
2003Jan/0099.html

    It appears that the current document addresses a number of these
    comments very directly; others less so or not at all.
    The XML Schema Working Group appears not to have reviewed or sent
    comments on the later working drafts of [38]3 October 2003 or [39]13
    February 2006.

      [38] http://www.w3.org/TR/2003/WD-xhtml-m12n-schema-20031003/
      [39] http://www.w3.org/TR/2006/PR-xhtml-modularization-20060213/

2. Substantive comments

    The following comments are substantive in the sense that they
    propose changes which would affect the validity of some documents in
    the XHTML family. Whether they are substantive in the sense that
    they would invalidate existing reviews of the Modularization
    document, we leave to others to decide.

2.1. Charset type

    Charset is defined as a vacuous restriction of xsd:string. That may
    be the right thing to do, but it seems likely that a better
    definition can be formulated. First, RFC 2045 defines charset values
    as either tokens or quoted-strings; it defines token as containing
    only ASCII characters and it seems to take over the definition of
    quoted-string from RFC 822, which define quoted-string as containing
    only ASCII characters. So a better definition of Charset might be

  <xsd:simpleType name="Other-Charset-identifier">
   <xsd:annotation>
    <xsd:documentation>
     <div xmlns="http://www.w3.org/1999/xhtml">
      <p>Charset values predefined by RFC 2046.  The RFC
       restricts these values to ASCII characters,
       i.e. those in the Unicode BasicLatin block.</p>
     </div>
    </xsd:documentation>
   </xsd:annotation>
   <xsd:restriction base="xsd:string">
    <xsd:pattern value="\p{IsBasicLatin}">
    </xsd:pattern>
   </xsd:restriction>
  </xsd:simpleType>

    The IANA registry seems to say that in fact charset identifiers are
    limited to 40 characters, but it's not clear whether that rule is
    intended by the XHTML spec to be binding on Charset values in HTML
    documents.

    Another point is that it might be more helpful for readers (and
    possibly implementors) to define the type in such a way as to
    identify at least some of the well-known identifiers which user
    agents should recognize — e.g. those mentioned in RFC 2046 — as well
    as others. One way to do this would be to define a type listing the
    charset values identified in RFC 2046, and then define a union of
    that type with xsd:string. The well-known charset values can be
    enumerated:

  <xsd:simpleType name="RFC2046-Predefined-charsets">
   <xsd:annotation>
    <xsd:documentation>
     <div xmlns="http://www.w3.org/1999/xhtml">
      <p>Charset values predefined by RFC 2046.  Other
       values are also accepted as charset values.</p>
     </div>
    </xsd:documentation>
   </xsd:annotation>
   <xsd:restriction base="xsd:string">
    <xsd:enumeration value="US-ASCII">
     <xsd:annotation>
      <xsd:documentation>As defined in ANSI X3.4-1986.</xsd:documentatio
n>
     </xsd:annotation>
    </xsd:enumeration>
    <xsd:enumeration value="ISO-8859-1"/>
    <xsd:enumeration value="ISO-8859-2"/>
    <xsd:enumeration value="ISO-8859-3"/>
    <xsd:enumeration value="ISO-8859-4"/>
    <xsd:enumeration value="ISO-8859-5"/>
    <xsd:enumeration value="ISO-8859-6"/>
    <xsd:enumeration value="ISO-8859-7"/>
    <xsd:enumeration value="ISO-8859-8"/>
    <xsd:enumeration value="ISO-8859-9"/>
    <xsd:enumeration value="ISO-8859-10"/>
   </xsd:restriction>
  </xsd:simpleType>

    The problem with this is that the RFCs define charset values as
    case-insensitive. So probably a better way to define the well known
    charset values would be with patterns:

  <xsd:simpleType name="RFC2046-Predefined-charsets">
   <xsd:annotation>
    <xsd:documentation>
     <div xmlns="http://www.w3.org/1999/xhtml">
      <p>Charset values predefined by RFC 2046.  Other
       values are also accepted.</p>
     </div>
    </xsd:documentation>
   </xsd:annotation>
   <xsd:restriction base="xsd:string">
    <xsd:whiteSpace value="collapse"/>
    <xsd:pattern value="[Uu][Ss]-[Aa][Ss][Cc][Ii][Ii]">
     <xsd:annotation>
      <xsd:documentation>As defined in ANSI X3.4-1986.</xsd:documentatio
n>
     </xsd:annotation>
    </xsd:pattern>
    <xsd:pattern value="[Ii][Ss][Oo]-8859-(10|[1-9])">
     <xsd:annotation>
      <xsd:documentation>ISO-8859 parts 1-10.</xsd:documentation>
     </xsd:annotation>
    </xsd:pattern>
   </xsd:restriction>
  </xsd:simpleType>

    The actual definition of Charset could usefully be a union of these
    two:
  <xsd:simpleType name="Charset">
   <xsd:annotation>
    <xsd:documentation>
     <div xmlns="http://www.w3.org/1999/xhtml">
      <p>Charset values.  Accept values predefined by RFC 2046,
       and also other values.</p>
     </div>
    </xsd:documentation>
   </xsd:annotation>
   <xsd:union memberTypes="
    xh11d:RFC2046-Predefined-charsets
    xh11d:Other-Charset-identifier
    ">
   </xsd:union>
  </xsd:simpleType>

    A more ambitous definition might mention all of the values in the
    IANA type registry, but the result, when examined, is rather long
    and not really very informative — rather like the registry itself
    — and it is not included here.

2.2. Color type

    Two things seem puzzling in the current definition of Color: (1) it
    allows any NMTOKEN, rather than just the sixteen well known color
    names. And (2) while six-digit hexadecimal values are allowed,
    three-digit values are not allowed. (The description of Color in
    HTML 4.01 (<URL:[40]http://www.w3.org/TR/html401/types.html#h-6.5>)
    doesn't actually specify how many digits are to be used for hex
    color values.)

    If these properties are unintentional, a type that identifies the
    well-known names and allows three-digit hex values may be better:

  <!-- sixteen color names or RGB color expression-->
  <xsd:simpleType name="Color">
   <xsd:union>
    <xsd:simpleType>
     <!--* Known color names are case-insensitive *-->
     <xsd:restriction base="xsd:NMTOKEN">
      <xsd:pattern value="[Bb][Ll][Aa][Cc][Kk]"/>
      <xsd:pattern value="[Gg][Rr][Ee][Ee][Nn]"/>
      <xsd:pattern value="[Ss][Ii][Ll][Vv][Ee][Rr]"/>
      <xsd:pattern value="[Ll][Ii][Mm][Ee]"/>
      <xsd:pattern value="[Gg][Rr][Aa][Yy]"/>
      <xsd:pattern value="[Oo][Ll][Ii][Vv][Ee]"/>
      <xsd:pattern value="[Ww][Hh][Ii][Tt][Ee]"/>
      <xsd:pattern value="[Yy][Ee][Ll][Ll][Oo][Ww]"/>
      <xsd:pattern value="[Mm][Aa][Rr][Oo][Oo][Nn]"/>
      <xsd:pattern value="[Nn][Aa][Vv][Yy]"/>
      <xsd:pattern value="[Rr][Ee][Dd]"/>
      <xsd:pattern value="[Bb][Ll][Uu][Ee]"/>
      <xsd:pattern value="[Pp][Uu][Rr][Pp][Ll][Ee]"/>
      <xsd:pattern value="[Tt][Ee][Aa][Ll]"/>
      <xsd:pattern value="[Ff][Uu][Cc][Hh][Ss][Ii][Aa]"/>
      <xsd:pattern value="[Aa][Qq][Uu][Aa]"/>
     </xsd:enumeration>
     </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType>
     <!--* Other numbers are expressed using a hash mark plus a
         * three- or six-digit hexadecimal number *-->
     <xsd:restriction base="xsd:token">
      <xsd:pattern value="#[0-9a-fA-F]{3}([0-9a-fA-F]{3})?"/>
     </xsd:restriction>
    </xsd:simpleType>
   </xsd:union>
  </xsd:simpleType>

      [40] http://www.w3.org/TR/html401/types.html#h-6.5

    If it's desired to allow other NMTOKEN values to count as valid, as
    well as the sixteen named by HTML 4.01 (e.g. for the system colors
    allowed by CSS2
    <URL:[41]http://www.w3.org/TR/REC-CSS2/syndata.html#value-def-color
    >]), then inserting

    <xsd:simpleType>
     <xsd:restriction base="xsd:NMTOKEN"/>
    </xsd:simpleType>

      [41] http://www.w3.org/TR/REC-CSS2/syndata.html#value-def-color

    as a final union member would do that. (Since the system colors of
    CSS2 appear to be a finite enumerated list, they could be defined in
    the same was as the sixteen names in HTML 4.01, although for clarity
    they should probably go into a different member type. That's left as
    an exercise for the reader.)

2.3. ContentType

    Like Charset, this could be defined as a union whose first member(s)
    recognize well-known values defined by the RFCs or in the IANA
    registry and whose final type (here xsd:string) takes care of
    extensibility. It's not clear to me whether the values are in fact
    limited by the RFC to ASCII characters; if so, xsd:string is a bit
    too broad.

2.4. Coords type

    Since the possible values of Coords values are so clearly specified
    in the spec, it seems a shame not to define the type a little more
    tightly. The absence of macros in XML Schema regular expressions
    makes life a little harder, but one reason XML Schema doesn't need
    macros in regexes is that we can use general entities. If we write
    the following entity declarations into the internal subset of the
    schema document, we have general entities which correspond to the
    important bits of coordinate strings, as defined in HTML
    (<URL:[42]http://www.w3.org/TR/html401/struct/objects.html#adef-coor
    ds>):

   <!ENTITY Pixel "\d+">
   <!ENTITY Percent "(\d+[%]|\d*\.\d+[%])">
   <!ENTITY Length "(&Pixel;|&Percent;)">
   <!ENTITY Comma  "\s*,\s*">
   <!ENTITY Pair   "&Length;&Comma;&Length;">

      [42] http://www.w3.org/TR/html401/struct/objects.html#adef-coords

    That allows the declarations to be fairly clear about their
    structure:

  <xsd:simpleType name="Coords.rect">
   <xsd:restriction base="xsd:token">
    <xsd:pattern value="(&Length;&Comma;){3}(&Length;)"/>
   </xsd:restriction>
  </xsd:simpleType>
  <xsd:simpleType name="Coords.circle">
   <xsd:restriction base="xsd:token">
    <xsd:pattern value="(&Length;&Comma;){2}(&Length;)"/>
   </xsd:restriction>
  </xsd:simpleType>
  <xsd:simpleType name="Coords.poly">
   <xsd:restriction base="xsd:token">
    <xsd:pattern value="(&Pair;&Comma;){2,unbounded}(&Pair;)"/>
   </xsd:restriction>
  </xsd:simpleType>

    If they prove to cause trouble for any schema processors, of course,
    the entity references can be expanded.
    And the Coords type can be clear that what is expected is either the
    coordinates for a rectangle, or those for a circle, or those for a
    polygon. (Type-aware systems can use the information about which
    member type in the union actually accepted the value to perform a
    sanity check: if the coords attribute has type Coords.rect, then the
    value of the shape attribute had better be 'rect', and vice versa.)

  <xsd:simpleType name="Coords">
   <xsd:union memberTypes="
     xh11d:Coords.rect
     xh11d:Coords.circle
     xh11d:Coords.poly">
   </xsd:union>
  </xsd:simpleType>

2.5. FPI type

    ISO 8879 appears to define the formal public identifier using a
    regular language, which means it's not necessary to allow any
    xsd:normalizedString value. (The formalization below assumes that
    only unregistered owner identifiers are to be used, since section
    3.6 of this spec says the value must begin with '-'.) Building it up
    gradually using entities, one can write:

   <!ENTITY minimum-data "[ a-zA-Z()+,\-./:/?]*">
   <!ENTITY owner-id   "&minimum-data;">
   <!ENTITY textclass1 "(DTD|ELEMENTS|ENTITIES|NOTATION|TEXT)">
   <!ENTITY textclass2 "(CAPACITY|CHARSET|DOCUMENT|LPD|NONSGML|SHORTREF|
SUBDOC|SYNTAX)">
   <!ENTITY textclass  "(&textclass1;|&textclass2;)">

    It's not clear that any of the names in textclass2 make any sense
    whatever for modules intended for use in the XHTML family, so one
    might choose to omit them.

   <!ENTITY langname   "(\i\c*)">
   <!ENTITY designator "&minimum-data;">
   <!ENTITY lang-or-des "(&langname;|&designator;)">
   <!ENTITY display    "&minimum-data;">

   <!ENTITY textid "&textclass; (-//)?&textdesc;//&lang-or-des;(//&displ
ay;)?">

   <!ENTITY fpi "-//&ownerid;//&textid;">

    The pattern is then quite simple:

  <xsd:simpleType name="FPI">
   <xsd:restriction base="xsd:normalizedString">
    <xsd:pattern value="&fpi;"/>
   </xsd:restriction>
  </xsd:simpleType>

2.6. FrameTarget type

    The HTML spec
    (<URL:[43]http://www.w3.org/TR/html401/types.html#h-6.16>) seems to
    want a slightly tighter definition of frame target names. Perhaps
    something like the following should be used.

  <xsd:simpleType name="FrameTarget">
   <xsd:union>
    <xsd:simpleType>
     <xsd:restriction base="xsd:NMTOKEN">
      <xsd:enumeration value="_blank"/>
      <xsd:enumeration value="_self"/>
      <xsd:enumeration value="_parent"/>
      <xsd:enumeration value="_top"/>
     </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType>
     <xsd:restriction base="xsd:string">
      <xsd:pattern value="[a-zA-Z].*"/>
     </xsd:restriction>
    </xsd:simpleType>
   </xsd:union>
  </xsd:simpleType>

      [43] http://www.w3.org/TR/html401/types.html#h-6.16

2.7. LinkTypes type

    LinkTypes is a good example of a type with what is sometimes called
    a ‘semi-open’ list of values. Some set of well-known values is
    defined, which software is encouraged to recognize and which authors
    are encouraged to use when appropriate, but for strict validity, a
    much larger set of values is allowed.

    In such cases, it's good practice to document the recognized types
    in the type definition. Since the well known values here are case
    insensitive, that's best done with a list of patterns rather than
    with an enumeration:

  <xsd:simpleType name="KnownLinkTypes">
   <xsd:restriction base="xsd:NMTOKEN">
    <xsd:pattern value="[Aa][Ll][Tt][Ee][Rr][Nn][Aa][Tt][Ee]"/>
    <xsd:pattern value="[Ss][Tt][Yy][Ll][Ee][Ss][Hh][Ee][Ee][Tt]"/>
    <xsd:pattern value="[Ss][Tt][Aa][Rr][Tt]"/>
    <xsd:pattern value="[Nn][Ee][Xx][Tt]"/>
    <xsd:pattern value="[Pp][Rr][Ee][Vv]"/>
    <xsd:pattern value="[Cc][Oo][Nn][Tt][Ee][Nn][Tt][Ss]"/>
    <xsd:pattern value="[Ii][Nn][Dd][Ee][Xx]"/>
    <xsd:pattern value="[Gg][Ll][Oo][Ss][Ss][Aa][Rr][Yy]"/>
    <xsd:pattern value="[Cc][Oo][Pp][Yy][Rr][Ii][Gg][Hh][Tt]"/>
    <xsd:pattern value="[Cc][Hh][Aa][Pp][Tt][Ee][Rr]"/>
    <xsd:pattern value="[Ss][Ee][Cc][Tt][Ii][Oo][Nn]"/>
    <xsd:pattern value="[Ss][Uu][Bb][Ss][Ee][Cc][Tt][Ii][Oo][Nn]"/>
    <xsd:pattern value="[Aa][Pp][Pp][Ee][Nn][Dd][Ii][Xx]"/>
    <xsd:pattern value="[Hh][Ee][Ll][Pp]"/>
    <xsd:pattern value="[Bb][Oo][Oo][Kk][Mm][Aa][Rr][Kk]"/>
    </xsd:enumeration>
   </xsd:restriction>
  </xsd:simpleType>

  <xsd:simpleType name="LinkTypes">
   <xsd:union memberTypes="xh11d:KnownLinkTypes xsd:NMTOKEN"/>
   </xsd:union>
  </xsd:simpleType>

2.8. Tightening other types

    If we continue in the same way, we risk belaboring out point past
    reason. So instead of commenting in detail on individual types which
    could, it seems to us, usefully be made more restrictive, or more
    informative, or both, by means of enumerations or patterns to
    recognize well known values or unions to combine subtypes (including
    more and less restrictive definitions of a datatype), we will merely
    say that we believe other types should also be given definitions
    closer to the requirements of the prose. (MultiLength, for example,
    is not really that hard to capture with a pattern.)

2.9. Named model groups vs. substitution groups

    We reiterate our advice of four years ago: the definition of the
    XHTML vocabulary would be easier to follow, and it would be easier
    to extend it, if the schema documents used substitution groups
    wherever feasible.

    If you have had specific problems applying substitution groups to
    XHTML, we would very much like to know what they were; we can
    speculate, but would prefer to hear from you.
    Using named model groups for extensibility has a number of
    unfortunate side effects. For example, the schema includes this
    definition:

   <xs:group
          name="xhtml.title.content">
          <xs:sequence/>
      </xs:group>

    What's the point of that, exactly? Presumably the idea is to play a
    similar trick to what you did when this was a DTD and splice your
    own stuff in there from your own namespace. But how does using a
    group get you there? It's not impossible, but it is harder than
    necessary and you could just as easily redefine the element in
    question directtly. So defining all these content groups just gums
    up the schema and makes it harder to read. (Those accustomed to
    DTD-based extension of vocabularies may have little trouble
    following the logic here, but that group may no longer be as large
    as it once was.)

    If a user wants to use XHTML and just add one little inline element
    or allow some new content in, say, the title element, the user has
    to jump through a few unnecessary hoops.

    This scenario could be better enabled even within the existing
    architecture just by adding an abstract substitution group head as a
    choice to all the named model groups.

    So even if you don't restructure the schema documents to use
    substitution groups wherever possible, you could simplify
    extensibility for users of the spec a great deal by just adding an
    abstract element to each group, or each content model where
    extensibiity is an obvious requirement, to provide hooks for later
    schema authors.

2.10. Adding attributes

    It's not clear that the way modules add attributes works. For
    example, the client side image map module adds attributes to the img
    element. All well and good, but looking at the schema I see an
    attribute group defined:

   <!-- modify img attribute definition list -->
      <xs:attributeGroup name="xhtml.img.csim.attlist">
          <xs:attribute name="usemap" type="xs:IDREF"/>
      </xs:attributeGroup>

    I can't see where this actually is used anywhere in the schema. I
    think what the module should be doing is a redefine of the groups.

2.11. A missing scenario

    One important scenario that seems to be missing is just plonking
    bits of the XHTML namespace into specific places in some other
    namespace. Maybe its too obvious/easy, but it is actually the most
    common scenario. e.g. MyOwnLanguage has its own things, and I'll
    just put some XHTML inline elements here.

    Introducing XHTML elements into the xsd:documentation elements in a
    schema document is another instance of the scenario.

3. Editorial comments

    The following comments are editorial; we hope that they can be made
    without invalidating any existing reviews of the specification.

3.1. Make the introduction less DTD-specific

    Section 1 Introduction
    <URL:[44]http://www.w3.org/TR/xhtml-modularization/introduction.html
    > also
    <URL:[45]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
    -20070219/introduction.html>
    sec 1.2 para 1: "These abstract modules are implemented in this
    specification using the XML Document Type Definition language, but
    an implementation using XML Schemas is expected." Read "These
    abstract modules are implemented in this specification using both
    the XML Document Type Definition language and XML Schema 1.0."?
    sec 1.3.4 para 2:

      [44] http://www.w3.org/TR/xhtml-modularization/introduction.html
      [45] http://www.w3.org/MarkUp/Group/2007/WD-xhtml- 
modularization-20070219/introduction.html

      A document is an instance of one particular document type defined
      by the DTD identified in the document's prologue. Validating the
      document is the process of checking that the document complies
      with the rules in the document type definition.

    Here (as elsewhere) there are traces of DTD-only terminology. Some
    SGML experts maintain that the term "document type definition" of
    ISO 8879 and XML is defined broadly enough to include schemas
    defined with XSD or with any other language currently known to
    information technology — on that reading, the only problem with the
    paragraph just quoted is the assumption that the document and its
    DTD are associated in the document's prologue.
    Normal usage, however, uses the term "document type definition" with
    narrower scope nowadays, to mean only those schemas written using
    the bracket-bang keyword syntax of ISO 8879 and the XML spec. On
    that reading, there are several things in this paragraph that apply
    only to conventional XML DTDs, not to schemas in general:
    In fact, any document is an instance of an infinite number of
    document types and schemas (or document type definitions), just as
    any object is contained by an infinite number of sets. This fact
    does not conflict with the equally important fact that an author may
    wish to advertise conformance to a particular schema or affiliation
    with a particular document type, either for the sake of tool support
    or for other reasons.

    Documents may be associated with a schema by their prolog, or by
    xsi:schemaLocation hints in the document instance, or by out-of-band
    associations between document and schema (e.g. by parameters passed
    to the validator at invocation time).
    Validation is the process of checking whether, not the process of
    ensuring that, a document complies with the rules in the document
    type definition.

    To make this paragraph cover the current situation (where you're
    providing normative XSD schema documents as well as normative DTDs),
    you might consider saying something like the following. If you're
    willing to adopt the term "schema" as the general term for a formal
    machine-readable expression of the rules for a document type, then:

      A document may be associated with a particular document type
      defined by a schema. The document's prolog may identify a DTD, or
      xsi:schemaLocation attributes may be used to associated the
      document with a schema written in XML Schema 1.0, or the document
      may be associated with a schema by other means (e.g.
      validation-time identification of the schema by means of a
      parameter passed to a validator). Validating the document is the
      process of testing whether the document complies with the rules in
      the schema.

    Or if you'd prefer to stay with "document type definition", you
    could write:

      A document may be associated with a particular document type. The
      document's prolog may identify a DTD, or xsi:schemaLocation
      attributes may be used to associated the document with a document
      type definition written in XML Schema 1.0, or the document may be
      associated with a document type definition by other means (e.g. a
      parameter passed to a validator). Validating the document is the
      process of testing whether the document complies with the rules in
      the document type definition.

    If you stick with "document type definition", you might want to add
    something to the definition of "document type definition" in the
    glossary, e.g. by changing the sentence:

      The same markup model may be expressed by a variety of DTDs.

    to something like

      The same markup model may be expressed by a variety of document
      type definitions, written in a variety of languages, such as the
      DTD notation of XML or XML Schema 1.0.

    just to make explicit somewhere that you're using "document type
    definition" to cover rules written in a variety of languages. You
    could mention Relax NG and/or Schematron, too, if you wish.

3.2. The term PCDATA

    Section 4.2
    <URL:[46]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
    -20070219/abstraction.html>
    4.2 para 1 reads in part

      [46] http://www.w3.org/MarkUp/Group/2007/WD-xhtml- 
modularization-20070219/abstraction.html

      ... In these cases, the symbol used for text is PCDATA (processed
      characted data). This is a term, defined in the XML 1.0
      Recommendation, that refers to processed character data. ...

    Strictly speaking, XML 1.0 doesn't define the term; it only says

      The keyword #PCDATA derives historically from the term "parsed
      character data."

    (Note also the typo 'characted' for 'character'.)
    We'd suggest rewording to say something like

      ... In these cases, the symbol used for text is PCDATA; this is
      short for "parsed character data", denoting sequences of
      characters which are to be parsed for markup by an XML processor.
      ...

3.3. Section 4.3 Attribute Types

    Congratulations to the editors; this section is much easier to read
    and follow than is sometimes the case when specs defined (or fail to
    define) fundamental types used throughout them.
    Some comments on the definitions of some of the datatypes, as found
    in
    <URL:[47]http://www.w3.org/TR/xhtml-modularization/SCHEMA/xhtml-data
    types-1.xsd> and other schema documents, may be found elsewhere.

      [47] http://www.w3.org/TR/xhtml-modularization/SCHEMA/xhtml- 
datatypes-1.xsd

3.4. Length type: well done

    The definition for Length seems well done. Good work!

3.5. Shape type

    Shouldn't the overview in section 4.3 say that Shape has just the
    four values rect, circle, ply, and default?

3.6. White space in the document source

    Minor but extremely irritating:
    <URL:[48]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
    -20070219/schema_module_defs.html#a_smodule_Text>
    <URL:[49]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
    -20070219/schema_module_defs.html#a_smodule_Presentation> (and
    presumably others) have the tabbing alignment in the schema messed
    up, making it harder to read.

      [48] http://www.w3.org/MarkUp/Group/2007/WD-xhtml- 
modularization-20070219/schema_module_defs.html#a_smodule_Text
      [49] http://www.w3.org/MarkUp/Group/2007/WD-xhtml- 
modularization-20070219/schema_module_defs.html#a_smodule_Presentation

4. Comments half substantive and half editorial

    The following comments may be regarded as purely editorial, or they
    may be regarded as substantive; we leave that judgment to you.

4.1. Testing the schema documents

    We endeavored to test the schema documents for syntax errors or
    other problems, but encountered some difficulty knowing where to
    start. Which file(s) should be used as the top-level driver file(s)?
    One test reported:

    I'm using files extracted from
    <URL:[50]http://www.w3.org/TR/xhtml-modularization/xhtml-modularizat
    ion.zip>.

      [50] http://www.w3.org/TR/xhtml-modularization/xhtml- 
modularization.zip

    xhtml-framework-1.xsd seems to be the root (the first one mentioned
    in Appendix C). But it won't compile (missing many att-groups like
    "xhtml.Core.extra.attrib" and "xhtml.I18n.extra.attrib"). I can't
    tell whether this is an error or users of these schemas must provide
    definitions of those att-groups. (Looks like the latter, because one
    of the examples myml-model-1.xsd defines those missing groups.)

    I was hoping testing.xml can be a little more helpful, but
    unfortunately it refers to
    <URL:[51]file:/C:/cygwin/home/ahby/htmlwg/xhtml-modularization/SCHEM
    A/xhtml11.xsd>
    I really hope I can't access someone else's "file:/C:/"
    xhtml11.xsd doesn't exist anywhere.

      [51] file://localhost/C:/cygwin/home/ahby/htmlwg/xhtml- 
modularization/SCHEMA/xhtml11.xsd

    So I gave up on that. Then I looked in the examples directory.
    "simpleml-1_0.xsd" doesn't refer to anything like "../". It
    redefines "xhtml.Misc.class" in
    http://www.w3.org/MarkUp/SCHEMA/xhtml-basic10.xsd. But Xerces-J
    fails to locate that group in the schema being redefined. (I found a
    Misc.class, but nothing starts with "xhtml.".) I then got many more
    errors about missing components. Similar to the ones I got from
    xhtml-framework-1.xsd, but different. (Note that these errors are
    from schema files in http://www.w3.org/MarkUp/SCHEMA/.)

    My last hope was those .html files in examples. Unfortunately they
    all they gave me was more errors, both in the schema and the
    instance.

    In summary, I don't know how these files should be used, so I can't
    claim that they are broken. No useful input from me ...

    [Later information from Shane McCarron is that this spec doesn't
    provide a driver, but that
    <URL:[52]http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd> might be
    consulted as an example. To be followed up ...)

      [52] http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd

4.2. Where is the html element?

    (Possibly related to the preceding.)
    Where is the html element defined?
    After some searching, starting not from this document but from
    <URL:[53]http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd>, we found a
    definition in
    <URL:[54]http://www.w3.org/MarkUp/SCHEMA/xhtml11-model-1.xsd>.
    This may be solely an editorial issue: the abstract says

      [53] http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd
      [54] http://www.w3.org/MarkUp/SCHEMA/xhtml11-model-1.xsd

      This modularization provides a means for subsetting and extending
      XHTML, a feature needed for extending XHTML's reach onto emerging
      platforms. This specification is intended for use by language
      designers as they construct new XHTML Family Markup Languages.

    and this had lead at least some readers to infer that the modules
    defined here would include everything needed for a definition of
    XHTML 1.1, including the top-level driver files.
    If the problem is editorial, the solution is also editorial: the
    spec needs to make clear(er) that no top-level driver for XHTML is
    provided. (And, for the instruction of those seeking to understand
    how to use these modules, a pointer to the XHTML 1.1 driver modules
    would be very useful. If such a pointer is already present, then let
    this note serve as a record that at least some readers didn't see
    the pointer when they needed to.)

    But the issue appears to at least some readers as at least partly
    substantive: that is, it seems to us that a specification describing
    a modular definition of the XHTML 1.1 vocabulary ought, in the
    nature of things, to include a top-level driver module which calls
    in all the others.

4.3. Case insensitivity and XML Schema patterns or enumerations

    Several of the alternative type definitions offered elsewhere in
    these comments propose to use patterns (rather than enuemerations,
    as one might expect) to handle the well known values for types which
    have well known values. In the numerous cases in which the values
    are defined as case insensitive, the pattern for a
    (case-insensitive) value like “black” is written “<xsd:pattern
    value="[Bb][Ll][Aa][Cc][Kk]"/>”.

    The regularity with which this technique must be used suggests that
    perhaps XML Schema should add a caseInsensitive flag to patterns.
    This would allow writing the pattern “<xsd:pattern value="black"
    caseInsensitve="true"/>” instead.

    Given that many regex libraries already have such flags, such an
    addition wouldn't seem to be difficult for implementors.
    Should the XML Schema Working Group consider such a change?

    And if so, what is to be done about Unicode characters for which the
    upper/lowercase mapping is not 1:1? And what should be done about
    title case?

Received on Tuesday, 27 February 2007 22:45:37 UTC