Better even later than never from Henry S. Thompson on 2000-01-27 (www-xml-schema-comments@w3.org from January to March 2000)

From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
Date: 27 Jan 2000 11:42:43 +0000
To: Robert DuCharme <DuCharmR@moodys.com>
Cc: www-xml-schema-comments@w3.org
Message-ID: <f5bsnzjyc3w.fsf@cogsci.ed.ac.uk>
Your comments [1] on an earlier draft of XML Schema were acknowledged
briefly at the time, but the care and extent of your input deserved
more attention than it (publicly) received.  Herewith our notes on the 
disposition of the issues you raised.  Each note is marked up with a
<response> element with a numeric id.  A <response>Moot</response>
indicates the point has been overtaken by changes in the WD since your 
contribution.  This does not mean the comment was wasted, but rather
that it has been addressed in the context of other changes.

Description

Comments on the 6-May-1999 "XML Schema Part 1" Working Draft
I like a lot of it, but I've limited this message to comments on general
concepts, practices, and choice of terminology that I feel need revision. In my
marked-up hard copy, I have noted typos and suggested revisions for clarity
that would fall into the category of "copyediting" and "basic Strunk and White
stuff" (turning passive sentences into active, etc.) if the WG is interested in
seeing them at this stage. As an example: the single sentence--and what a
sentence--from section 3.5, 
         "No element type is referenced by more than one of the explicit and
         acquired content models (unless two or more acquired models share
         modelElts acquired from a common ancestor, in which case such
         modelElts shall be ignored in all but the first for the purpose of
         constructing the effective model), in which case if the non-vacuous
         explicit and acquired models are all eltOnly the effective model is a
         sequence of all the non-vacuous acquired models, in the order in which
         they are specified in the refinements list, followed by the explicit
         model (if it is non-vacuous), or else if the non-vacuous explicit and
         acquired models are all mixed, the effective model is a mixed whose
         elementTypeRefs and elementTypeDecls are the union of the
         elementTypeRefs and elementTypeDecls of all the non-vacuous explicit
         and acquired models."  

could be revised into the following four sentences over two paragraphs: 
         "No element type is referenced by more than one of the explicit and
         acquired content models (unless two or more acquired models share
         modelElts acquired from a common ancestor, in which case the processor
         shall ignore all but the first modelElt when constructing the
         effective model). There are two ways this can happen: either the
         non-vacuous explicit and acquired models are all eltOnly or they are
         mixed. If they are all eltOnly, the effective model is a sequence of
         all the non-vacuous acquired models, in the order in which the 
         refinements list specifies them, followed by the explicit model if it
         is non-vacuous. If the non-vacuous explicit and acquired models are
         all mixed, the effective model is a mixed content model whose
         elementTypeRefs and elementTypeDecls are the union of the
         elementTypeRefs and elementTypeDecls of all the non-vacuous explicit
         and acquired models."  

My revision may betray a misunderstanding of the original's meaning, but the
general idea of breaking down overlong sentences (134 words!) into multiple
sentences or even bulleted lists would make the Structures document much easier
to understand. 

<response id="1">Specific observation is mooted by changes in the text. General observation
is editoral request for clarity and simplicity of expression.</response>

I've broken down the rest of this into three sections: General Issues,
Terminology, and Specifics by Section.  

General Issues

Examples: some, like those throughout section 3.5 "Archetype Refinement," are
excellent, with good explanations using complete sentences and examples that
use real-world names to make the purpose of the demonstrated constructions 
clearer. However, many if not most examples in the Structures document merely
demonstrate the syntax of a construction without giving any clues as to how and
why it is used. Demonstrating the declaration of a foo with the example 
"<foo name="myFoo"> tells readers nothing about the purpose of a foo that they
couldn't find out from Appendixes A and B (ironically, the brief comments in
Appendix B's DTD sometimes explain the purpose of certain constructs better
than any part of the Structures document itself). Values of "name1" or "name2"
for the name attribute are no better.  

<response id="2">Request for better examples which do not merely demonstrate syntax of a
construct but also explain how and why the construct is used. The examples in
the latest draft being substantially better than those in May draft, we may
regard this as having been discharged.</response>

HTML rendering: references to sections within the Structures document look the
same as links to definitions, making a sentence like "See <ul>Element Type
Declaration</ul> for discussion and examples of the appearance of
<ul>elementTypeDecl</ul> above" (3.4.6) difficult to read. I suggest either
italicizing section titles in these references or adding the phrase "the
section" in front of them--for example, "See the section <ul>Element Type
Declaration</ul> for discussion..." 

<response id="3">Current draft puts section marker and number on section references but not
on definitions; believe this is discharged.</response>

Terminology
Many important new terms are used repeatedly before they are defined. For
example, the revised paragraph above uses the term "vacuous," which hasn't been
defined yet, five times; "archetype" and "NCName" are also used repeatedly
before any clues about their meanings are given. A complete definition at first
use of each new term may cause structural problems, but an abbreviated,
parenthesized definition at first use (section 3.4's definition of SC is a good
model), with a pointer to the full definition would make the document much
easier to understand for readers whose first introduction to the proposal is a
cover-to-cover reading of this document. Perhaps an introductory overview like
the SOX Note's "Structure of a SOX Document" would be a good place to first
bring up these concepts and terms. It would make the remainder of the spec much
easier to read. If a new section isn't added, at least more entries could be
added to section 2.4.  

<response id="4">Editorial: request to define terms before use.</response>

The Structure document's frequent misuse of parts of speech (for example, using
verbs like "include," "specialize," and "import" as both nouns and adjectives,
"specialize" as a noun, and adjective like "fixed" as a noun) make it very
difficult to read. I can only imagine what it's like for someone not speaking
English as a first language. To say "this is a technical usage" is no excuse
unless there is a good precedent (Knuth, dragon book, etc.) for a given
term. Otherwise, that's like saying "we're computer people, it's OK for us,
deal with it." See more about this on "include" below.

<response id="5">Editorial: request not to misuse parts of speech.</response>

When a non-noun (for example, "specialize") is used as a noun because it's a
token (that is, the lhs of some production in the document), references to it
would be easier to read if described as "a specialize token" (or constraint, or
whatever). This is done nicely in the comment before Appendix A's element type
declaration for archetype: "It may include a refines element that specifies..."
Other places in the Structures document would have put this "It may include a
refines that specifies..."  Obviously the former is clearer. 

<response id="6">Editorial: request to handle use of certain terms as tokens to be explicit
as in 'refines element' or 'import token'. Believe that this is now largely the
case in the current draft, although there remain some instances,
e.g. definition at top of 6.2.3.7.</response> 

Vacuous: this is a pejorative term, and therefore more colorful than any
alternatives that I'm sure were considered, but do you need this much color?
"Vacant" would be more appropriate.  

<response id="7">Editorial: largely moot -- only one instance of 'vacuous' exists.</response>

Refine: The standard English use of the term gets twisted too far. To "refine"
something is to change it, not to created a changed copy. I assume that
"inherit" was considered and rejected, although I don't understand why,
especially considering the associated vocabulary brought along with it, like
"ancestor" and "daughter."  

<response id="8">Use 'inherit' instead of 'refine'.</response>

Daughter: I assume that this is used instead of "children" because of the
latter's use in referring to contained elements. 
"Son" would be considered sexist, but so is "daughter." To me, "daughter"
implies that there is a binary distinction between two types of
descendants. (What if red-black trees had been called "son-daughter" trees?)
Why not just call these "descendants"?

<response id="9">Object to the term 'daughter'.</response>

Export: as with "refine," the use of the term has something in common with the
standard English usage but also something significantly different from it,
which will confuse people. To export something is to actively send it
somewhere, whether you're sending bourbon from Kentucky to Japan or a
comma-delimited file from Excel to a named directory. To merely make something
available for import does not export it. (On the other hand, "import" as used
in the schema spec does make sense.) 

<response id="10">Object to the term 'export'.</response>

Nearly well-formed: the term "nearly" adds vagueness that doesn't help any
specification. "Nearly well-formed" says that a document falls short of
complete well-formedness and that we're not sure where it falls short. For a
document whose incompleteness in meeting a certain ideal can be specifically
identified (as "nearly well-formed" is used in the document) a term like
"adequately well-formed" would be more appropriate.  

<response id="11">Object to the term 'nearly well-formed' on grounds that we don't know how a
nearly well-formed document falls shorted of well-formedness.  Believe this is
moot -- nearly well-formed is defined precisely.</response>

include (as a noun): This is well-understood by programmers, but I don't
consider it a technical term. Like the term "dialog" to refer to a dialog box,
it's programmer slang. The Merriam-Webster dictionary has no listing for
"include" as a noun, but it does define "inclusion" as "something that is
included." For a more computer science way to say it, "included external
resource" would also make sense. The last paragraph of 4.7, in addition to
using "include" as a noun, also uses "included schema," which is much better. 

<response id="12">Objects to use of 'include' as a noun. Believe in current draft this
actually comes down to the same as issue DuCharme#6</response>

Plural of "schema": the document uses the term "schemata" several times and
"schemas" many more times. Either it should spell out a specific reason for
using one over the other in certain contexts or it should pick one, identify it
in the glossary definition of "scheme" (just as a dictionary names a plural in
a definition) and use it consistently. (My vote: "Schemas." As Orwell put it,
"Bad writers, and especially scientific, political and sociological writers,
are nearly always haunted by the notion that Latin or Greek words are grander
than Saxon ones." http://www.bnl.com/shorts/stories/patel.html) global and
top-level: both are used several times in the document, but I couldn't find a
definition of either in the document. I'm guessing that "top-level" means a
non-nested elementTypeDecl. Whether I'm right or wrong, it's meaning should be
made more explicit. 

<response id="13">Moot</response>

Specifics by Section
1) 2.1. definition of "Schema"
"...the information set of XML documents" is pretty broad; doesn't it mean "the
information set of a particular class/collection/set/type of documents? The
Structures document never mentions the concept of a "document class" or
"document type." Does it ever describe a way to refer to a collection of
documents conforming to a particular schema? Or do we just assume the use of
the XML term document type?  

<response id="14">Moot</response>

2) 2.4 Purpose of "Archetype Definition," "Content Type," and "Element Content
Model" "Elements" in each of these is vague much like "documents" is in 2.1 as
described above. Each use of the term looks like it refers to *all* the
elements in a document instance; don't they mean "a specified class/set/type of
elements," especially considering that each defined term is given in the
singular?  

<response id="15">Greater precision desired in 2.4 in that it should confine to particular
classes.</response> 

3) 3.1 caption under second example
Does "new component" refer to a new component of a schema? A new class of
components for a document? Who is the "we" doing the declaring? Isn't the
schema doing the declaring? The distinction between creating, declaring, and
specifying ("the specification for that component") in this sentence is
confusing. Does the sentence mean "By declaring a new component, a schema
associates that component's name with the specification for that component"?  

<response id="16">Editorial: caption in second example in 3.1 lacks clarity and precision.</response>

4) 3.3, "Constraint on Schemas: One Reference Only"
"It is an error for both these attributes to appear on the same element in a
schema." Then perhaps they shouldn't be attributes. If they were child elements
of the import element type, a (schemaAbbrev|schemaName) equivalent in the
content model would put this constraint in the schema language's concrete
syntax, where its enforcement is more easily automated than that of a
constraint that is only described in prose documentation.  

<response id="17">Moot</response>

5) 3.3, last paragraph
The use of the term "appropriate" (three times) is confusing. 

<response id="18">Moot</response>

6) 3.3, last paragraph
"...may also obtain." May also what? 

<response id="19">Moot</response>

7) 3.4.2 first paragraph
"...pertinent to elements in instance documents." See 2) above. 

<response id="20">Duplicate//moot</response>

8) 3.4.4 Attribute Group Definitions
If I understand archetypes correctly, they can (among other things) group a
collection of attribute definitions into a named, reusable unit, so I don't see
what named attribute groups add to the schema language. What am I missing?  

<response id="21">Attribute group definitions unnecessary. Duplicate of existing (closed)
issue #34.</response>

9) 3.4.9 first sentence
"An element type declares the..." should read "An element type declaration
declares the..." An element type doesn't declare anything; it gets declared. 

<response id="22">Moot</response>

10) 3.5 "substitutability" definition
"One archetype is substitutable for another if any schema-valid instance of the
former is necessarily..."  The term "document instance" throughout the
Structures document makes sense, as does the concept of an element
instance. This line seems to be referring to an archetype instance, which I
don't understand. Or does it mean "schema-valid element instance conforming to
the former is necessary..."?  

<response id="23">Moot</response>

11) 3.5 "NOTE" describing regularPolygon example
So the example's regularPolygon element is valid with respect to the polygon
archetype, even though it has a "side" child element not mentioned by the
polygon archetype declaration, because polygon has a "model" value of
"refinable," right?  

<response id="24">Moot</response>

12) 3.6.1 "flavor can now be used in an entity reference in instances of the
containing schema" as well as in document instances that conform to the
containing schema, right?  

<response id="25">Moot</response>

13) 4.1 title
If "Instance Document Constructs" are different from "Instance Documents" then
they should be defined. If not, the title should just say "Instance Documents."

<response id="26">Moot</response>

14) 4.2 second example
The empty "export" element has an improperly closed XML comment. 

<response id="27">Moot</response>

15) 4.3 NOTE
"Head" is never defined. Does this mean right after the <schema> start-tag?
Does it mean the very beginning of the document, or right after the XML
declaration if there is one? It needs to be clarified.  

<response id="28">Moot</response>

16) 4.5 first paragraph
"Composed" is emphasized, but never defined. I assume it has no connection to
compositor (production [36]).  

<response id="29">Moot</response>

17) 4.6 second example
I believe that second <import start-tag should be an end-tag. 

<response id="30">Moot</response>

18) 6.1 paragraph beginning "The provision within..."
"The effective element item of an element item (call this OEI)..." Why? What
does the "O" stand for? Overall, there's a lot of great stuff in the draft. I
look forward to the software that can work with these schema; kudos to Rick 
Jelliffe for jumping right in there! 

<response id="31">Definition of effective element item: object to abbreviation OEI</response>


Thanks again

ht, with a _lot_ of help from Mary Holstege

[1] http://lists.w3.org/Archives/Public/www-xml-schema-comments/1999AprJun/0038.html 
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/
Received on Thursday, 27 January 2000 06:42:46 UTC