- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Tue, 27 Feb 2007 15:44:45 -0700
- To: www-html-editor@w3.org
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, Schema IG <w3c-xml-schema-ig@w3.org>
Dear colleagues:
On behalf of the XML Schema Working Group, I congratulate the
HTML Working Group on your progress with XHTML Modularization.
As described in the comments below, owing to a snafu
the XML Schema WG did not review the Last Call WD of XHTML
Modularization 1.1 last summer. In the hopes that the maxim
"better late than never" is true in this case, we transmit
to you now our comments on the document. My apologies for
the snafu.
Our comments are available at any of the URIs
http://www.w3.org/XML/Group/2007/02/m12n-of-xhtml.xsd-comments
http://www.w3.org/XML/Group/2007/02/m12n-of-xhtml.xsd-comments.xml
http://www.w3.org/XML/Group/2007/02/m12n-of-xhtml.xsd-comments.html
A text version is provided below for those who find it more
convenient.
--C. M. Sperberg-McQueen
on behalf of the W3C XML Schema WG
Notes on
XHTML Modularization 1.1
Ed. by
C. M. Sperberg-McQueen
Submitted to the HTML Working Group on behalf of the XML Schema Working
Group
27 February 2007
$Id: m12n-of-xhtml.xsd-comments.html,v 1.1 2007/02/27 22:36:18 cmsmcq
Exp $
_________________________________________________________
* 1. [7]Background
* 2. [8]Substantive comments
+ 2.1. [9]Charset type
+ 2.2. [10]Color type
+ 2.3. [11]ContentType
+ 2.4. [12]Coords type
+ 2.5. [13]FPI type
+ 2.6. [14]FrameTarget type
+ 2.7. [15]LinkTypes type
+ 2.8. [16]Tightening other types
+ 2.9. [17]Named model groups vs. substitution groups
+ 2.10. [18]Adding attributes
+ 2.11. [19]A missing scenario
* 3. [20]Editorial comments
+ 3.1. [21]Make the introduction less DTD-specific
+ 3.2. [22]The term PCDATA
+ 3.3. [23]Section 4.3 Attribute Types
+ 3.4. [24]Length type: well done
+ 3.5. [25]Shape type
+ 3.6. [26]White space in the document source
* 4. [27]Comments half substantive and half editorial
+ 4.1. [28]Testing the schema documents
+ 4.2. [29]Where is the html element?
+ 4.3. [30]Case insensitivity and XML Schema patterns or
enumerations
_________________________________________________________
NOTE:
This document contains comments on the [31]Last Call Working Draft
of XHTML™ Modularization 1.1. Several different readers formulated
the comments; the editor has not attempted to unify and organize
them strictly. The comments are forwarded to the XHTML Working Group
on behalf of the XML Schema Working Group, but it should be noted
that the XML Schema Working Group has not had the leisure to
consider them in detail.
The Last Call comment period on this draft ended 4 August 2006, so
these comments are very late. They are being forwarded nonetheless
in the hopes that even at this late date they may prove useful to
those responsible for the XHTML Modularization spec.
To minimize wasted effort, the copy actually consulted is the
[32]editor's copy of 19 February 2007.
[31] http://www.w3.org/TR/2006/WD-xhtml-modularization-20060705
[32] http://www.w3.org/MarkUp/Group/2007/WD-xhtml-
modularization-20070219/introduction.html
1. Background
Owing apparently to human error, the XML Schema Working Group failed
to attend to the publication of the Last Call draft of [33]XHTML
Modularization 1.1, and consequently failed to review the spec
during the scheduled last-call comment period.
We apologize for this oversight; our chair has administered severe
counseling to our staff contact, and our staff contact has promised
he will endeavor not to make similar mistakes in future.
Since HTML and XHTML constitute by far the most widely used
vocabularies published by any W3C Working Group, the Schema Working
Group has a deep interest in making sure the formulations of XHTML
using XML Schema are as useful as possible.
The following comments have been prepared in haste, in an attempt to
perform as useful a review as possible.
The Schema Working Group's previous comments (apparently on the
[34]Last Call draft of 9 December 2002) are at
<URL:[35]http://www.w3.org/XML/Group/2003/01/xmlschema-notes-on-xhtm
l-modularization.html> and were transmitted to the HTML WG in
<URL:[36]http://lists.w3.org/Archives/Public/www-html-editor/2003Jan
Mar/0043.html> and
<URL:[37]http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2003J
an/0099.html>.
A quick summary of the earlier comments:
1. Please use the appropriate simple types.
2. Exploit substitution groups.
3. Explain what to do about multiple schemas for same namespace.
4. Don't declare everything blocked and final!
5. Sec 2.2.6 is opaque.
6. Point to external documentation.
7. Provide internal documentation.
8. Clarify conformance.
9. More concrete extension scenarios.
10. Exhibit structure of schema better.
[33] http://www.w3.org/TR/2006/WD-xhtml-modularization-20060705
[34] http://www.w3.org/TR/2002/WD-xhtml-m12n-schema-20021209/
[35] http://www.w3.org/XML/Group/2003/01/xmlschema-notes-on-
xhtml-modularization.html
[36] http://lists.w3.org/Archives/Public/www-html-editor/
2003JanMar/0043.html
[37] http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/
2003Jan/0099.html
It appears that the current document addresses a number of these
comments very directly; others less so or not at all.
The XML Schema Working Group appears not to have reviewed or sent
comments on the later working drafts of [38]3 October 2003 or [39]13
February 2006.
[38] http://www.w3.org/TR/2003/WD-xhtml-m12n-schema-20031003/
[39] http://www.w3.org/TR/2006/PR-xhtml-modularization-20060213/
2. Substantive comments
The following comments are substantive in the sense that they
propose changes which would affect the validity of some documents in
the XHTML family. Whether they are substantive in the sense that
they would invalidate existing reviews of the Modularization
document, we leave to others to decide.
2.1. Charset type
Charset is defined as a vacuous restriction of xsd:string. That may
be the right thing to do, but it seems likely that a better
definition can be formulated. First, RFC 2045 defines charset values
as either tokens or quoted-strings; it defines token as containing
only ASCII characters and it seems to take over the definition of
quoted-string from RFC 822, which define quoted-string as containing
only ASCII characters. So a better definition of Charset might be
<xsd:simpleType name="Other-Charset-identifier">
<xsd:annotation>
<xsd:documentation>
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Charset values predefined by RFC 2046. The RFC
restricts these values to ASCII characters,
i.e. those in the Unicode BasicLatin block.</p>
</div>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:string">
<xsd:pattern value="\p{IsBasicLatin}">
</xsd:pattern>
</xsd:restriction>
</xsd:simpleType>
The IANA registry seems to say that in fact charset identifiers are
limited to 40 characters, but it's not clear whether that rule is
intended by the XHTML spec to be binding on Charset values in HTML
documents.
Another point is that it might be more helpful for readers (and
possibly implementors) to define the type in such a way as to
identify at least some of the well-known identifiers which user
agents should recognize — e.g. those mentioned in RFC 2046 — as well
as others. One way to do this would be to define a type listing the
charset values identified in RFC 2046, and then define a union of
that type with xsd:string. The well-known charset values can be
enumerated:
<xsd:simpleType name="RFC2046-Predefined-charsets">
<xsd:annotation>
<xsd:documentation>
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Charset values predefined by RFC 2046. Other
values are also accepted as charset values.</p>
</div>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="US-ASCII">
<xsd:annotation>
<xsd:documentation>As defined in ANSI X3.4-1986.</xsd:documentatio
n>
</xsd:annotation>
</xsd:enumeration>
<xsd:enumeration value="ISO-8859-1"/>
<xsd:enumeration value="ISO-8859-2"/>
<xsd:enumeration value="ISO-8859-3"/>
<xsd:enumeration value="ISO-8859-4"/>
<xsd:enumeration value="ISO-8859-5"/>
<xsd:enumeration value="ISO-8859-6"/>
<xsd:enumeration value="ISO-8859-7"/>
<xsd:enumeration value="ISO-8859-8"/>
<xsd:enumeration value="ISO-8859-9"/>
<xsd:enumeration value="ISO-8859-10"/>
</xsd:restriction>
</xsd:simpleType>
The problem with this is that the RFCs define charset values as
case-insensitive. So probably a better way to define the well known
charset values would be with patterns:
<xsd:simpleType name="RFC2046-Predefined-charsets">
<xsd:annotation>
<xsd:documentation>
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Charset values predefined by RFC 2046. Other
values are also accepted.</p>
</div>
</xsd:documentation>
</xsd:annotation>
<xsd:restriction base="xsd:string">
<xsd:whiteSpace value="collapse"/>
<xsd:pattern value="[Uu][Ss]-[Aa][Ss][Cc][Ii][Ii]">
<xsd:annotation>
<xsd:documentation>As defined in ANSI X3.4-1986.</xsd:documentatio
n>
</xsd:annotation>
</xsd:pattern>
<xsd:pattern value="[Ii][Ss][Oo]-8859-(10|[1-9])">
<xsd:annotation>
<xsd:documentation>ISO-8859 parts 1-10.</xsd:documentation>
</xsd:annotation>
</xsd:pattern>
</xsd:restriction>
</xsd:simpleType>
The actual definition of Charset could usefully be a union of these
two:
<xsd:simpleType name="Charset">
<xsd:annotation>
<xsd:documentation>
<div xmlns="http://www.w3.org/1999/xhtml">
<p>Charset values. Accept values predefined by RFC 2046,
and also other values.</p>
</div>
</xsd:documentation>
</xsd:annotation>
<xsd:union memberTypes="
xh11d:RFC2046-Predefined-charsets
xh11d:Other-Charset-identifier
">
</xsd:union>
</xsd:simpleType>
A more ambitous definition might mention all of the values in the
IANA type registry, but the result, when examined, is rather long
and not really very informative — rather like the registry itself
— and it is not included here.
2.2. Color type
Two things seem puzzling in the current definition of Color: (1) it
allows any NMTOKEN, rather than just the sixteen well known color
names. And (2) while six-digit hexadecimal values are allowed,
three-digit values are not allowed. (The description of Color in
HTML 4.01 (<URL:[40]http://www.w3.org/TR/html401/types.html#h-6.5>)
doesn't actually specify how many digits are to be used for hex
color values.)
If these properties are unintentional, a type that identifies the
well-known names and allows three-digit hex values may be better:
<!-- sixteen color names or RGB color expression-->
<xsd:simpleType name="Color">
<xsd:union>
<xsd:simpleType>
<!--* Known color names are case-insensitive *-->
<xsd:restriction base="xsd:NMTOKEN">
<xsd:pattern value="[Bb][Ll][Aa][Cc][Kk]"/>
<xsd:pattern value="[Gg][Rr][Ee][Ee][Nn]"/>
<xsd:pattern value="[Ss][Ii][Ll][Vv][Ee][Rr]"/>
<xsd:pattern value="[Ll][Ii][Mm][Ee]"/>
<xsd:pattern value="[Gg][Rr][Aa][Yy]"/>
<xsd:pattern value="[Oo][Ll][Ii][Vv][Ee]"/>
<xsd:pattern value="[Ww][Hh][Ii][Tt][Ee]"/>
<xsd:pattern value="[Yy][Ee][Ll][Ll][Oo][Ww]"/>
<xsd:pattern value="[Mm][Aa][Rr][Oo][Oo][Nn]"/>
<xsd:pattern value="[Nn][Aa][Vv][Yy]"/>
<xsd:pattern value="[Rr][Ee][Dd]"/>
<xsd:pattern value="[Bb][Ll][Uu][Ee]"/>
<xsd:pattern value="[Pp][Uu][Rr][Pp][Ll][Ee]"/>
<xsd:pattern value="[Tt][Ee][Aa][Ll]"/>
<xsd:pattern value="[Ff][Uu][Cc][Hh][Ss][Ii][Aa]"/>
<xsd:pattern value="[Aa][Qq][Uu][Aa]"/>
</xsd:enumeration>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType>
<!--* Other numbers are expressed using a hash mark plus a
* three- or six-digit hexadecimal number *-->
<xsd:restriction base="xsd:token">
<xsd:pattern value="#[0-9a-fA-F]{3}([0-9a-fA-F]{3})?"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:union>
</xsd:simpleType>
[40] http://www.w3.org/TR/html401/types.html#h-6.5
If it's desired to allow other NMTOKEN values to count as valid, as
well as the sixteen named by HTML 4.01 (e.g. for the system colors
allowed by CSS2
<URL:[41]http://www.w3.org/TR/REC-CSS2/syndata.html#value-def-color
>]), then inserting
<xsd:simpleType>
<xsd:restriction base="xsd:NMTOKEN"/>
</xsd:simpleType>
[41] http://www.w3.org/TR/REC-CSS2/syndata.html#value-def-color
as a final union member would do that. (Since the system colors of
CSS2 appear to be a finite enumerated list, they could be defined in
the same was as the sixteen names in HTML 4.01, although for clarity
they should probably go into a different member type. That's left as
an exercise for the reader.)
2.3. ContentType
Like Charset, this could be defined as a union whose first member(s)
recognize well-known values defined by the RFCs or in the IANA
registry and whose final type (here xsd:string) takes care of
extensibility. It's not clear to me whether the values are in fact
limited by the RFC to ASCII characters; if so, xsd:string is a bit
too broad.
2.4. Coords type
Since the possible values of Coords values are so clearly specified
in the spec, it seems a shame not to define the type a little more
tightly. The absence of macros in XML Schema regular expressions
makes life a little harder, but one reason XML Schema doesn't need
macros in regexes is that we can use general entities. If we write
the following entity declarations into the internal subset of the
schema document, we have general entities which correspond to the
important bits of coordinate strings, as defined in HTML
(<URL:[42]http://www.w3.org/TR/html401/struct/objects.html#adef-coor
ds>):
<!ENTITY Pixel "\d+">
<!ENTITY Percent "(\d+[%]|\d*\.\d+[%])">
<!ENTITY Length "(&Pixel;|&Percent;)">
<!ENTITY Comma "\s*,\s*">
<!ENTITY Pair "&Length;&Comma;&Length;">
[42] http://www.w3.org/TR/html401/struct/objects.html#adef-coords
That allows the declarations to be fairly clear about their
structure:
<xsd:simpleType name="Coords.rect">
<xsd:restriction base="xsd:token">
<xsd:pattern value="(&Length;&Comma;){3}(&Length;)"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="Coords.circle">
<xsd:restriction base="xsd:token">
<xsd:pattern value="(&Length;&Comma;){2}(&Length;)"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="Coords.poly">
<xsd:restriction base="xsd:token">
<xsd:pattern value="(&Pair;&Comma;){2,unbounded}(&Pair;)"/>
</xsd:restriction>
</xsd:simpleType>
If they prove to cause trouble for any schema processors, of course,
the entity references can be expanded.
And the Coords type can be clear that what is expected is either the
coordinates for a rectangle, or those for a circle, or those for a
polygon. (Type-aware systems can use the information about which
member type in the union actually accepted the value to perform a
sanity check: if the coords attribute has type Coords.rect, then the
value of the shape attribute had better be 'rect', and vice versa.)
<xsd:simpleType name="Coords">
<xsd:union memberTypes="
xh11d:Coords.rect
xh11d:Coords.circle
xh11d:Coords.poly">
</xsd:union>
</xsd:simpleType>
2.5. FPI type
ISO 8879 appears to define the formal public identifier using a
regular language, which means it's not necessary to allow any
xsd:normalizedString value. (The formalization below assumes that
only unregistered owner identifiers are to be used, since section
3.6 of this spec says the value must begin with '-'.) Building it up
gradually using entities, one can write:
<!ENTITY minimum-data "[ a-zA-Z()+,\-./:/?]*">
<!ENTITY owner-id "&minimum-data;">
<!ENTITY textclass1 "(DTD|ELEMENTS|ENTITIES|NOTATION|TEXT)">
<!ENTITY textclass2 "(CAPACITY|CHARSET|DOCUMENT|LPD|NONSGML|SHORTREF|
SUBDOC|SYNTAX)">
<!ENTITY textclass "(&textclass1;|&textclass2;)">
It's not clear that any of the names in textclass2 make any sense
whatever for modules intended for use in the XHTML family, so one
might choose to omit them.
<!ENTITY langname "(\i\c*)">
<!ENTITY designator "&minimum-data;">
<!ENTITY lang-or-des "(&langname;|&designator;)">
<!ENTITY display "&minimum-data;">
<!ENTITY textid "&textclass; (-//)?&textdesc;//&lang-or-des;(//&displ
ay;)?">
<!ENTITY fpi "-//&ownerid;//&textid;">
The pattern is then quite simple:
<xsd:simpleType name="FPI">
<xsd:restriction base="xsd:normalizedString">
<xsd:pattern value="&fpi;"/>
</xsd:restriction>
</xsd:simpleType>
2.6. FrameTarget type
The HTML spec
(<URL:[43]http://www.w3.org/TR/html401/types.html#h-6.16>) seems to
want a slightly tighter definition of frame target names. Perhaps
something like the following should be used.
<xsd:simpleType name="FrameTarget">
<xsd:union>
<xsd:simpleType>
<xsd:restriction base="xsd:NMTOKEN">
<xsd:enumeration value="_blank"/>
<xsd:enumeration value="_self"/>
<xsd:enumeration value="_parent"/>
<xsd:enumeration value="_top"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="[a-zA-Z].*"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:union>
</xsd:simpleType>
[43] http://www.w3.org/TR/html401/types.html#h-6.16
2.7. LinkTypes type
LinkTypes is a good example of a type with what is sometimes called
a ‘semi-open’ list of values. Some set of well-known values is
defined, which software is encouraged to recognize and which authors
are encouraged to use when appropriate, but for strict validity, a
much larger set of values is allowed.
In such cases, it's good practice to document the recognized types
in the type definition. Since the well known values here are case
insensitive, that's best done with a list of patterns rather than
with an enumeration:
<xsd:simpleType name="KnownLinkTypes">
<xsd:restriction base="xsd:NMTOKEN">
<xsd:pattern value="[Aa][Ll][Tt][Ee][Rr][Nn][Aa][Tt][Ee]"/>
<xsd:pattern value="[Ss][Tt][Yy][Ll][Ee][Ss][Hh][Ee][Ee][Tt]"/>
<xsd:pattern value="[Ss][Tt][Aa][Rr][Tt]"/>
<xsd:pattern value="[Nn][Ee][Xx][Tt]"/>
<xsd:pattern value="[Pp][Rr][Ee][Vv]"/>
<xsd:pattern value="[Cc][Oo][Nn][Tt][Ee][Nn][Tt][Ss]"/>
<xsd:pattern value="[Ii][Nn][Dd][Ee][Xx]"/>
<xsd:pattern value="[Gg][Ll][Oo][Ss][Ss][Aa][Rr][Yy]"/>
<xsd:pattern value="[Cc][Oo][Pp][Yy][Rr][Ii][Gg][Hh][Tt]"/>
<xsd:pattern value="[Cc][Hh][Aa][Pp][Tt][Ee][Rr]"/>
<xsd:pattern value="[Ss][Ee][Cc][Tt][Ii][Oo][Nn]"/>
<xsd:pattern value="[Ss][Uu][Bb][Ss][Ee][Cc][Tt][Ii][Oo][Nn]"/>
<xsd:pattern value="[Aa][Pp][Pp][Ee][Nn][Dd][Ii][Xx]"/>
<xsd:pattern value="[Hh][Ee][Ll][Pp]"/>
<xsd:pattern value="[Bb][Oo][Oo][Kk][Mm][Aa][Rr][Kk]"/>
</xsd:enumeration>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="LinkTypes">
<xsd:union memberTypes="xh11d:KnownLinkTypes xsd:NMTOKEN"/>
</xsd:union>
</xsd:simpleType>
2.8. Tightening other types
If we continue in the same way, we risk belaboring out point past
reason. So instead of commenting in detail on individual types which
could, it seems to us, usefully be made more restrictive, or more
informative, or both, by means of enumerations or patterns to
recognize well known values or unions to combine subtypes (including
more and less restrictive definitions of a datatype), we will merely
say that we believe other types should also be given definitions
closer to the requirements of the prose. (MultiLength, for example,
is not really that hard to capture with a pattern.)
2.9. Named model groups vs. substitution groups
We reiterate our advice of four years ago: the definition of the
XHTML vocabulary would be easier to follow, and it would be easier
to extend it, if the schema documents used substitution groups
wherever feasible.
If you have had specific problems applying substitution groups to
XHTML, we would very much like to know what they were; we can
speculate, but would prefer to hear from you.
Using named model groups for extensibility has a number of
unfortunate side effects. For example, the schema includes this
definition:
<xs:group
name="xhtml.title.content">
<xs:sequence/>
</xs:group>
What's the point of that, exactly? Presumably the idea is to play a
similar trick to what you did when this was a DTD and splice your
own stuff in there from your own namespace. But how does using a
group get you there? It's not impossible, but it is harder than
necessary and you could just as easily redefine the element in
question directtly. So defining all these content groups just gums
up the schema and makes it harder to read. (Those accustomed to
DTD-based extension of vocabularies may have little trouble
following the logic here, but that group may no longer be as large
as it once was.)
If a user wants to use XHTML and just add one little inline element
or allow some new content in, say, the title element, the user has
to jump through a few unnecessary hoops.
This scenario could be better enabled even within the existing
architecture just by adding an abstract substitution group head as a
choice to all the named model groups.
So even if you don't restructure the schema documents to use
substitution groups wherever possible, you could simplify
extensibility for users of the spec a great deal by just adding an
abstract element to each group, or each content model where
extensibiity is an obvious requirement, to provide hooks for later
schema authors.
2.10. Adding attributes
It's not clear that the way modules add attributes works. For
example, the client side image map module adds attributes to the img
element. All well and good, but looking at the schema I see an
attribute group defined:
<!-- modify img attribute definition list -->
<xs:attributeGroup name="xhtml.img.csim.attlist">
<xs:attribute name="usemap" type="xs:IDREF"/>
</xs:attributeGroup>
I can't see where this actually is used anywhere in the schema. I
think what the module should be doing is a redefine of the groups.
2.11. A missing scenario
One important scenario that seems to be missing is just plonking
bits of the XHTML namespace into specific places in some other
namespace. Maybe its too obvious/easy, but it is actually the most
common scenario. e.g. MyOwnLanguage has its own things, and I'll
just put some XHTML inline elements here.
Introducing XHTML elements into the xsd:documentation elements in a
schema document is another instance of the scenario.
3. Editorial comments
The following comments are editorial; we hope that they can be made
without invalidating any existing reviews of the specification.
3.1. Make the introduction less DTD-specific
Section 1 Introduction
<URL:[44]http://www.w3.org/TR/xhtml-modularization/introduction.html
> also
<URL:[45]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
-20070219/introduction.html>
sec 1.2 para 1: "These abstract modules are implemented in this
specification using the XML Document Type Definition language, but
an implementation using XML Schemas is expected." Read "These
abstract modules are implemented in this specification using both
the XML Document Type Definition language and XML Schema 1.0."?
sec 1.3.4 para 2:
[44] http://www.w3.org/TR/xhtml-modularization/introduction.html
[45] http://www.w3.org/MarkUp/Group/2007/WD-xhtml-
modularization-20070219/introduction.html
A document is an instance of one particular document type defined
by the DTD identified in the document's prologue. Validating the
document is the process of checking that the document complies
with the rules in the document type definition.
Here (as elsewhere) there are traces of DTD-only terminology. Some
SGML experts maintain that the term "document type definition" of
ISO 8879 and XML is defined broadly enough to include schemas
defined with XSD or with any other language currently known to
information technology — on that reading, the only problem with the
paragraph just quoted is the assumption that the document and its
DTD are associated in the document's prologue.
Normal usage, however, uses the term "document type definition" with
narrower scope nowadays, to mean only those schemas written using
the bracket-bang keyword syntax of ISO 8879 and the XML spec. On
that reading, there are several things in this paragraph that apply
only to conventional XML DTDs, not to schemas in general:
In fact, any document is an instance of an infinite number of
document types and schemas (or document type definitions), just as
any object is contained by an infinite number of sets. This fact
does not conflict with the equally important fact that an author may
wish to advertise conformance to a particular schema or affiliation
with a particular document type, either for the sake of tool support
or for other reasons.
Documents may be associated with a schema by their prolog, or by
xsi:schemaLocation hints in the document instance, or by out-of-band
associations between document and schema (e.g. by parameters passed
to the validator at invocation time).
Validation is the process of checking whether, not the process of
ensuring that, a document complies with the rules in the document
type definition.
To make this paragraph cover the current situation (where you're
providing normative XSD schema documents as well as normative DTDs),
you might consider saying something like the following. If you're
willing to adopt the term "schema" as the general term for a formal
machine-readable expression of the rules for a document type, then:
A document may be associated with a particular document type
defined by a schema. The document's prolog may identify a DTD, or
xsi:schemaLocation attributes may be used to associated the
document with a schema written in XML Schema 1.0, or the document
may be associated with a schema by other means (e.g.
validation-time identification of the schema by means of a
parameter passed to a validator). Validating the document is the
process of testing whether the document complies with the rules in
the schema.
Or if you'd prefer to stay with "document type definition", you
could write:
A document may be associated with a particular document type. The
document's prolog may identify a DTD, or xsi:schemaLocation
attributes may be used to associated the document with a document
type definition written in XML Schema 1.0, or the document may be
associated with a document type definition by other means (e.g. a
parameter passed to a validator). Validating the document is the
process of testing whether the document complies with the rules in
the document type definition.
If you stick with "document type definition", you might want to add
something to the definition of "document type definition" in the
glossary, e.g. by changing the sentence:
The same markup model may be expressed by a variety of DTDs.
to something like
The same markup model may be expressed by a variety of document
type definitions, written in a variety of languages, such as the
DTD notation of XML or XML Schema 1.0.
just to make explicit somewhere that you're using "document type
definition" to cover rules written in a variety of languages. You
could mention Relax NG and/or Schematron, too, if you wish.
3.2. The term PCDATA
Section 4.2
<URL:[46]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
-20070219/abstraction.html>
4.2 para 1 reads in part
[46] http://www.w3.org/MarkUp/Group/2007/WD-xhtml-
modularization-20070219/abstraction.html
... In these cases, the symbol used for text is PCDATA (processed
characted data). This is a term, defined in the XML 1.0
Recommendation, that refers to processed character data. ...
Strictly speaking, XML 1.0 doesn't define the term; it only says
The keyword #PCDATA derives historically from the term "parsed
character data."
(Note also the typo 'characted' for 'character'.)
We'd suggest rewording to say something like
... In these cases, the symbol used for text is PCDATA; this is
short for "parsed character data", denoting sequences of
characters which are to be parsed for markup by an XML processor.
...
3.3. Section 4.3 Attribute Types
Congratulations to the editors; this section is much easier to read
and follow than is sometimes the case when specs defined (or fail to
define) fundamental types used throughout them.
Some comments on the definitions of some of the datatypes, as found
in
<URL:[47]http://www.w3.org/TR/xhtml-modularization/SCHEMA/xhtml-data
types-1.xsd> and other schema documents, may be found elsewhere.
[47] http://www.w3.org/TR/xhtml-modularization/SCHEMA/xhtml-
datatypes-1.xsd
3.4. Length type: well done
The definition for Length seems well done. Good work!
3.5. Shape type
Shouldn't the overview in section 4.3 say that Shape has just the
four values rect, circle, ply, and default?
3.6. White space in the document source
Minor but extremely irritating:
<URL:[48]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
-20070219/schema_module_defs.html#a_smodule_Text>
<URL:[49]http://www.w3.org/MarkUp/Group/2007/WD-xhtml-modularization
-20070219/schema_module_defs.html#a_smodule_Presentation> (and
presumably others) have the tabbing alignment in the schema messed
up, making it harder to read.
[48] http://www.w3.org/MarkUp/Group/2007/WD-xhtml-
modularization-20070219/schema_module_defs.html#a_smodule_Text
[49] http://www.w3.org/MarkUp/Group/2007/WD-xhtml-
modularization-20070219/schema_module_defs.html#a_smodule_Presentation
4. Comments half substantive and half editorial
The following comments may be regarded as purely editorial, or they
may be regarded as substantive; we leave that judgment to you.
4.1. Testing the schema documents
We endeavored to test the schema documents for syntax errors or
other problems, but encountered some difficulty knowing where to
start. Which file(s) should be used as the top-level driver file(s)?
One test reported:
I'm using files extracted from
<URL:[50]http://www.w3.org/TR/xhtml-modularization/xhtml-modularizat
ion.zip>.
[50] http://www.w3.org/TR/xhtml-modularization/xhtml-
modularization.zip
xhtml-framework-1.xsd seems to be the root (the first one mentioned
in Appendix C). But it won't compile (missing many att-groups like
"xhtml.Core.extra.attrib" and "xhtml.I18n.extra.attrib"). I can't
tell whether this is an error or users of these schemas must provide
definitions of those att-groups. (Looks like the latter, because one
of the examples myml-model-1.xsd defines those missing groups.)
I was hoping testing.xml can be a little more helpful, but
unfortunately it refers to
<URL:[51]file:/C:/cygwin/home/ahby/htmlwg/xhtml-modularization/SCHEM
A/xhtml11.xsd>
I really hope I can't access someone else's "file:/C:/"
xhtml11.xsd doesn't exist anywhere.
[51] file://localhost/C:/cygwin/home/ahby/htmlwg/xhtml-
modularization/SCHEMA/xhtml11.xsd
So I gave up on that. Then I looked in the examples directory.
"simpleml-1_0.xsd" doesn't refer to anything like "../". It
redefines "xhtml.Misc.class" in
http://www.w3.org/MarkUp/SCHEMA/xhtml-basic10.xsd. But Xerces-J
fails to locate that group in the schema being redefined. (I found a
Misc.class, but nothing starts with "xhtml.".) I then got many more
errors about missing components. Similar to the ones I got from
xhtml-framework-1.xsd, but different. (Note that these errors are
from schema files in http://www.w3.org/MarkUp/SCHEMA/.)
My last hope was those .html files in examples. Unfortunately they
all they gave me was more errors, both in the schema and the
instance.
In summary, I don't know how these files should be used, so I can't
claim that they are broken. No useful input from me ...
[Later information from Shane McCarron is that this spec doesn't
provide a driver, but that
<URL:[52]http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd> might be
consulted as an example. To be followed up ...)
[52] http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd
4.2. Where is the html element?
(Possibly related to the preceding.)
Where is the html element defined?
After some searching, starting not from this document but from
<URL:[53]http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd>, we found a
definition in
<URL:[54]http://www.w3.org/MarkUp/SCHEMA/xhtml11-model-1.xsd>.
This may be solely an editorial issue: the abstract says
[53] http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd
[54] http://www.w3.org/MarkUp/SCHEMA/xhtml11-model-1.xsd
This modularization provides a means for subsetting and extending
XHTML, a feature needed for extending XHTML's reach onto emerging
platforms. This specification is intended for use by language
designers as they construct new XHTML Family Markup Languages.
and this had lead at least some readers to infer that the modules
defined here would include everything needed for a definition of
XHTML 1.1, including the top-level driver files.
If the problem is editorial, the solution is also editorial: the
spec needs to make clear(er) that no top-level driver for XHTML is
provided. (And, for the instruction of those seeking to understand
how to use these modules, a pointer to the XHTML 1.1 driver modules
would be very useful. If such a pointer is already present, then let
this note serve as a record that at least some readers didn't see
the pointer when they needed to.)
But the issue appears to at least some readers as at least partly
substantive: that is, it seems to us that a specification describing
a modular definition of the XHTML 1.1 vocabulary ought, in the
nature of things, to include a top-level driver module which calls
in all the others.
4.3. Case insensitivity and XML Schema patterns or enumerations
Several of the alternative type definitions offered elsewhere in
these comments propose to use patterns (rather than enuemerations,
as one might expect) to handle the well known values for types which
have well known values. In the numerous cases in which the values
are defined as case insensitive, the pattern for a
(case-insensitive) value like “black” is written “<xsd:pattern
value="[Bb][Ll][Aa][Cc][Kk]"/>”.
The regularity with which this technique must be used suggests that
perhaps XML Schema should add a caseInsensitive flag to patterns.
This would allow writing the pattern “<xsd:pattern value="black"
caseInsensitve="true"/>” instead.
Given that many regex libraries already have such flags, such an
addition wouldn't seem to be difficult for implementors.
Should the XML Schema Working Group consider such a change?
And if so, what is to be done about Unicode characters for which the
upper/lowercase mapping is not 1:1? And what should be done about
title case?
Received on Tuesday, 27 February 2007 22:45:37 UTC