Re: comments on XHTML Modularization 1.1 from XML Schema WG from Steven Pemberton on 2007-08-22 (www-html-editor@w3.org from July to September 2007)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Wed, 22 Aug 2007 14:20:39 +0200
To: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, www-html-editor@w3.org
Cc: "Schema IG" <w3c-xml-schema-ig@w3.org>
Message-ID: <op.txgm8pidsmjzpq@acer3010.lan>
Dear Michael, and other colleagues,

Thank you for your belated last call comments on XHTML Modularization 1.1.
  http://lists.w3.org/Archives/Public/www-html-editor/2007JanMar/0035

To return the favour, here is our belated reply :-)/2 (largely caused by  
our rechartering, which happened after your comments arrived).

2.1. Charset type

     Charset is defined as a vacuous restriction of xsd:string. That may
     be the right thing to do, but it seems likely that a better
     definition can be formulated.
[...]
     A more ambitous definition might mention all of the values in the
     IANA type registry, but the result, when examined, is rather long
     and not really very informative — rather like the registry itself
     — and it is not included here.

While we agree on the principle of validating as much as possible, we are  
wary of duplicating someone else's list in a specification: we run the  
risk of making the schema brittle, and needing to be regularly updated.

2.2. Color type

     Two things seem puzzling in the current definition of Color: (1) it
     allows any NMTOKEN, rather than just the sixteen well known color
     names. And (2) while six-digit hexadecimal values are allowed,
     three-digit values are not allowed. (The description of Color in
     HTML 4.01 (<URL:[40]http://www.w3.org/TR/html401/types.html#h-6.5>)
     doesn't actually specify how many digits are to be used for hex
     color values.)

Three digit hex colour values were introduced in CSS, and are not actually  
a part of HTML; in fact we agree that the HTML definition is a little  
unclear, and only seems to suggest what the correct values are through  
examples. The problem is, with legacy content now on the web, it is  
difficult to say whether colour "#FAB" should be interpreted as "#FFAABB"  
as it is in CSS, or "#000FAB" as would be suggested if you interpret the  
value as "a hexadecimal number" which is what the specification says it  
is. Since the 6 digit version is the only likely interoperable one, we  
prefer to keep it at that. As for the sixteen well-known values, while  
these are defined in the HTML4 specification, many other values are now  
extant and interoperable on the web (and remember that Modularization is  
for a whole family of languages, not just HTML4 derivatives).

2.3. ContentType

     Like Charset, this could be defined as a union whose first member(s)
     recognize well-known values defined by the RFCs or in the IANA
     registry and whose final type (here xsd:string) takes care of
     extensibility. It's not clear to me whether the values are in fact
     limited by the RFC to ASCII characters; if so, xsd:string is a bit
     too broad.

We are considering this change for a future revision.

2.4. Coords type

     Since the possible values of Coords values are so clearly specified
     in the spec, it seems a shame not to define the type a little more
     tightly.

This seems like a reasonable suggestion.

2.5. FPI type
[...]
     The pattern is then quite simple:

   <xsd:simpleType name="FPI">
    <xsd:restriction base="xsd:normalizedString">
     <xsd:pattern value="&fpi;"/>
    </xsd:restriction>
   </xsd:simpleType>

Looks good.

2.6. FrameTarget type

     The HTML spec
     (<URL:[43]http://www.w3.org/TR/html401/types.html#h-6.16>) seems to
     want a slightly tighter definition of frame target names. Perhaps
     something like the following should be used.

Good idea

2.7. LinkTypes type

     LinkTypes is a good example of a type with what is sometimes called
     a ‘semi-open’ list of values. Some set of well-known values is
     defined, which software is encouraged to recognize and which authors
     are encouraged to use when appropriate, but for strict validity, a
     much larger set of values is allowed.

     In such cases, it's good practice to document the recognized types
     in the type definition. Since the well known values here are case
     insensitive, that's best done with a list of patterns rather than
     with an enumeration:

Frankly this looks rather like overkill to us. These values are intended  
only to be an initial set, and many more to be used, so we don't really  
see the value-add of including these few in the schema (especially since  
it is not really readable).

2.8. Tightening other types

In general we agree that closed sets of values should be more tightly  
defined; we are not so enamoured of defining values of open sets, since  
there is no validation win.

2.9. Named model groups vs. substitution groups

     We reiterate our advice of four years ago: the definition of the
     XHTML vocabulary would be easier to follow, and it would be easier
     to extend it, if the schema documents used substitution groups
     wherever feasible.

     If you have had specific problems applying substitution groups to
     XHTML, we would very much like to know what they were; we can
     speculate, but would prefer to hear from you.

The people who produced the schema felt that the approach used here to be  
the most consistent with Modularization in general, and the one most  
likely to work. However, we take your advice seriously, and would like to  
adopt this. However, in order to allow modularization to proceed without  
too much more delay, we will not adopt this (rather drastic) change in  
this version, but save it for the planned version 2.

2.10. Adding attributes

     It's not clear that the way modules add attributes works. For
     example, the client side image map module adds attributes to the img
     element. All well and good, but looking at the schema I see an
     attribute group defined:

    <!-- modify img attribute definition list -->
       <xs:attributeGroup name="xhtml.img.csim.attlist">
           <xs:attribute name="usemap" type="xs:IDREF"/>
       </xs:attributeGroup>

     I can't see where this actually is used anywhere in the schema. I
     think what the module should be doing is a redefine of the groups.

The extension mechanisms get used in the 'drivers' which define a language  
on the basis of the modules. There is no driver supplied with  
modularization; you need to look at a particular language's use of  
Modularization to see these in use.

2.11. A missing scenario

     One important scenario that seems to be missing is just plonking
     bits of the XHTML namespace into specific places in some other
     namespace. Maybe its too obvious/easy, but it is actually the most
     common scenario. e.g. MyOwnLanguage has its own things, and I'll
     just put some XHTML inline elements here.

     Introducing XHTML elements into the xsd:documentation elements in a
     schema document is another instance of the scenario.

We have a concept of 'integration sets' which allow this usage. What we  
will do is add an example to the spec to show how to do this, to make it  
clearer.

3.1. Make the introduction less DTD-specific

This should be much better now.

3.2. The term PCDATA

fixed.

3.5. Shape type

     Shouldn't the overview in section 4.3 say that Shape has just the
     four values rect, circle, ply, and default?

Yes, it should, and will.

3.6. White space in the document source

Thanks. We will do a clean up prior to publication.

4.1. Testing the schema documents
[...]
     [Later information from Shane McCarron is that this spec doesn't
     provide a driver, but that
     <URL:[52]http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd> might be
     consulted as an example. To be followed up ...)

Indeed, the Modularization spec doesn't include any drivers. We have added  
an informative link to one.

4.2. Where is the html element?

     Where is the html element defined?

It is in the structure module.

    (And, for the instruction of those seeking to understand
     how to use these modules, a pointer to the XHTML 1.1 driver modules
     would be very useful.

Done.

     But the issue appears to at least some readers as at least partly
     substantive: that is, it seems to us that a specification describing
     a modular definition of the XHTML 1.1 vocabulary ought, in the
     nature of things, to include a top-level driver module which calls
     in all the others.

Coming from a group that didn't include a mechanism to specify what the  
root element is, I am shocked!
But seriously, this is modularization 1.1, not the modularization of XHTML  
1.1. Modularization 1.1 is and will be used by many different languages.  
(See for instance
    http://www.w3.org/MarkUp/Group/2007/xhtml-modularization-11-implementation
)

4.3. Case insensitivity and XML Schema patterns or enumerations
[...]
     Given that many regex libraries already have such flags, such an
     addition wouldn't seem to be difficult for implementors.
     Should the XML Schema Working Group consider such a change?

It would make certain declarations easier to write, and make them actually  
readable.

     And if so, what is to be done about Unicode characters for which the
     upper/lowercase mapping is not 1:1? And what should be done about
     title case?

Ha! You're asking the wrong people...

Thanks for the comments.

Best wishes,

Steven Pemberton
For the XHTML2 Working Group
Received on Wednesday, 22 August 2007 12:20:46 UTC