"Structures" comments from DuCharme, Robert on 1999-05-13 (www-xml-schema-comments@w3.org from April to June 1999)

From: DuCharme, Robert <DuCharmR@moodys.com>
Date: Thu, 13 May 1999 11:15:51 -0400
To: "'www-xml-schema-comments@w3.org'" <www-xml-schema-comments@w3.org>
Message-ID: <84285D7CF8E9D2119B1100805FD40F9F255267@MDYNYCMSX1>
Comments on the 6-May-1999 "XML Schema Part 1" Working Draft
============================================================

I like a lot of it, but I've limited this message to comments on general
concepts, practices, and choice of terminology that I feel need
revision. In my marked-up hard copy, I have noted typos and suggested
revisions for clarity that would fall into the category of "copyediting"
and "basic Strunk and White stuff" (turning passive sentences into
active, etc.) if the WG is interested in seeing them at this stage. As
an example: the single sentence--and what a sentence--from section 3.5,

"No element type is referenced by more than one of the explicit and
acquired content models (unless two or more acquired models share
modelElts acquired from a common ancestor, in which case such modelElts
shall be ignored in all but the first for the purpose of constructing
the effective model), in which case if the non-vacuous explicit and
acquired models are all eltOnly the effective model is a sequence of all
the non-vacuous acquired models, in the order in which they are
specified in the refinements list, followed by the explicit model (if it
is non-vacuous), or else if the non-vacuous explicit and acquired models
are all mixed, the effective model is a mixed whose elementTypeRefs and
elementTypeDecls are the union of the elementTypeRefs and
elementTypeDecls of all the non-vacuous explicit and acquired models."

could be revised into the following four sentences over two paragraphs:

"No element type is referenced by more than one of the explicit and
acquired content models (unless two or more acquired models share
modelElts acquired from a common ancestor, in which case the processor
shall ignore all but the first modelElt when constructing the effective
model).

There are two ways this can happen: either the non-vacuous explicit and
acquired models are all eltOnly or they are mixed. If they are all
eltOnly, the effective model is a sequence of all the non-vacuous
acquired models, in the order in which the refinements list specifies
them, followed by the explicit model if it is non-vacuous. If the
non-vacuous explicit and acquired models are all mixed, the effective
model is a mixed content model whose elementTypeRefs and
elementTypeDecls are the union of the elementTypeRefs and
elementTypeDecls of all the non-vacuous explicit and acquired models."

My revision may betray a misunderstanding of the original's meaning, but
the general idea of breaking down overlong sentences (134 words!) into
multiple sentences or even bulleted lists would make the Structures
document much easier to understand.

I've broken down the rest of this into three sections: General Issues,
Terminology, and Specifics by Section.

*General Issues 

Examples: some, like those throughout section 3.5 "Archetype
Refinement," are excellent, with good explanations using complete
sentences and examples that use real-world names to make the purpose of
the demonstrated constructions clearer. However, many if not most
examples in the Structures document merely demonstrate the syntax of a
construction without giving any clues as to how and why it is used.
Demonstrating the declaration of a foo with the example "&lt;foo
name="myFoo"> tells readers nothing about the purpose of a foo that they
couldn't find out from Appendixes A and B (ironically, the brief
comments in Appendix B's DTD sometimes explain the purpose of certain
constructs better than any part of the Structures document itself).
Values of "name1" or "name2" for the name attribute are no better.

HTML rendering: references to sections within the Structures document
look the same as links to definitions, making a sentence like "See
&lt;ul>Element Type Declaration&lt;/ul> for discussion and examples of
the appearance of &lt;ul>elementTypeDecl&lt;/ul> above" (3.4.6)
difficult to read. I suggest either italicizing section titles in these
references or adding the phrase "the section" in front of them--for
example, "See the section &lt;ul>Element Type Declaration&lt;/ul> for
discussion..."

*Terminology 

Many important new terms are used repeatedly before they are defined.
For example, the revised paragraph above uses the term "vacuous," which
hasn't been defined yet, five times; "archetype" and "NCName" are also
used repeatedly before any clues about their meanings are given. A
complete definition at first use of each new term may cause structural
problems, but an abbreviated, parenthesized definition at first use
(section 3.4's definition of SC is a good model), with a pointer to the
full definition would make the document much easier to understand for
readers whose first introduction to the proposal is a cover-to-cover
reading of this document. Perhaps an introductory overview like the SOX
Note's "Structure of a SOX Document" would be a good place to first
bring up these concepts and terms. It would make the remainder of the
spec much easier to read. If a new section isn't added, at least more
entries could be added to section 2.4.

The Structure document's frequent misuse of parts of speech (for
example, using verbs like "include," "specialize," and "import" as both
nouns and adjectives, "specialize" as a noun, and adjective like "fixed"
as a noun) make it very difficult to read. I can only imagine what it's
like for someone not speaking English as a first language. To say "this
is a technical usage" is no excuse unless there is a good precedent
(Knuth, dragon book, etc.) for a given term. Otherwise, that's like
saying "we're computer people, it's OK for us, deal with it." See more
about this on "include" below.

When a non-noun (for example, "specialize") is used as a noun because
it's a token (that is, the lhs of some production in the document),
references to it would be easier to read if described as "a specialize
token" (or constraint, or whatever). This is done nicely in the comment
before Appendix A's element type declaration for archetype: "It may
include a refines element that specifies..." Other places in the
Structures document would have put this "It may include a refines that
specifies..." Obviously the former is clearer.

Vacuous: this is a pejorative term, and therefore more colorful than any
alternatives that I'm sure were considered, but do you need this much
color?  "Vacant" would be more appropriate.

Refine: The standard English use of the term gets twisted too far. To
"refine" something is to change it, not to created a changed copy. I
assume that "inherit" was considered and rejected, although I don't
understand why, especially considering the associated vocabulary brought
along with it, like "ancestor" and "daughter."

Daughter: I assume that this is used instead of "children" because of
the latter's use in referring to contained elements. "Son" would be
considered sexist, but so is "daughter." To me, "daughter" implies that
there is a binary distinction between two types of descendants. (What if
red-black trees had been called "son-daughter" trees?) Why not just call
these "descendants"?

Export: as with "refine," the use of the term has something in common
with the standard English usage but also something significantly
different from it, which will confuse people. To export something is to
actively send it somewhere, whether you're sending bourbon from Kentucky
to Japan or a comma-delimited file from Excel to a named directory. To
merely make something available for import does not export it. (On the
other hand, "import" as used in the schema spec does make sense.)

Nearly well-formed: the term "nearly" adds vagueness that doesn't help
any specification. "Nearly well-formed" says that a document falls short
of complete well-formedness and that we're not sure where it falls
short. For a document whose incompleteness in meeting a certain ideal
can be specifically identified (as "nearly well-formed" is used in the
document) a term like "adequately well-formed" would be more
appropriate.

include (as a noun): This is well-understood by programmers, but I don't
consider it a technical term. Like the term "dialog" to refer to a
dialog box, it's programmer slang. The Merriam-Webster dictionary has no
listing for "include" as a noun, but it does define "inclusion" as
"something that is included." For a more computer science way to say it,
"included external resource" would also make sense. The last paragraph
of 4.7, in addition to using "include" as a noun, also uses "included
schema," which is much better.

Plural of "schema": the document uses the term "schemata" several times
and "schemas" many more times. Either it should spell out a specific
reason for using one over the other in certain contexts or it should
pick one, identify it in the glossary definition of "scheme" (just as a
dictionary names a plural in a definition) and use it consistently. (My
vote: "Schemas." As Orwell put it, "Bad writers, and especially
scientific, political and sociological writers, are nearly always
haunted by the notion that Latin or Greek words are grander than Saxon
ones." http://www.bnl.com/shorts/stories/patel.html)

global and top-level: both are used several times in the document, but I
couldn't find a definition of either in the document. I'm guessing that
"top-level" means a non-nested elementTypeDecl. Whether I'm right or
wrong, it's meaning should be made more explicit.

*Specifics by Section

1) 2.1. definition of "Schema"

"...the information set of XML documents" is pretty broad; doesn't it
mean "the information set of a particular class/collection/set/type of
documents?  The Structures document never mentions the concept of a
"document class" or "document type." Does it ever describe a way to
refer to a collection of documents conforming to a particular schema?
Or do we just assume the use of the XML term document type?

2) 2.4 Purpose of "Archetype Definition," "Content Type," and "Element
Content Model"

"Elements" in each of these is vague much like "documents" is in 2.1 as
described above. Each use of the term looks like it refers to *all* the
elements in a document instance; don't they mean "a specified
class/set/type of elements," especially considering that each defined
term is given in the singular?

3) 3.1 caption under second example

Does "new component" refer to a new component of a schema? A new class
of components for a document? Who is the "we" doing the declaring? Isn't
the schema doing the declaring? The distinction between creating,
declaring, and specifying ("the specification for that component") in
this sentence is confusing. Does the sentence mean "By declaring a new
component, a schema associates that component's name with the
specification for that component"?

4) 3.3, "Constraint on Schemas: One Reference Only"

"It is an error for both these attributes to appear on the same element
in a schema."  Then perhaps they shouldn't be attributes. If they were
child elements of the import element type, a (schemaAbbrev|schemaName)
equivalent in the content model would put this constraint in the schema
language's concrete syntax, where its enforcement is more easily
automated than that of a constraint that is only described in prose
documentation.

5) 3.3, last paragraph

The use of the term "appropriate" (three times) is confusing.

6) 3.3, last paragraph

"...may also obtain." May also what?

7) 3.4.2 first paragraph

"...pertinent to elements in instance documents." See 2) above.

8) 3.4.4 Attribute Group Definitions

If I understand archetypes correctly, they can (among other things)
group a collection of attribute definitions into a named, reusable unit,
so I don't see what named attribute groups add to the schema language.
What am I missing?

9) 3.4.9 first sentence

"An element type declares the..." should read "An element type
declaration declares the..." An element type doesn't declare anything;
it gets declared.

10) 3.5 "substitutability" definition

"One archetype is substitutable for another if any schema-valid instance
of the former is necessarily..."

The term "document instance" throughout the Structures document makes
sense, as does the concept of an element instance. This line seems to be
referring to an archetype instance, which I don't understand. Or does it
mean "schema-valid element instance conforming to the former is
necessary..."?

11) 3.5 "NOTE" describing regularPolygon example

So the example's regularPolygon element is valid with respect to the
polygon archetype, even though it has a "side" child element not
mentioned by the polygon archetype declaration, because polygon has a
"model" value of "refinable," right?

12) 3.6.1 "flavor can now be used in an entity reference in instances of
the containing schema" as well as in document instances that conform to
the containing schema, right?

13) 4.1 title

If "Instance Document Constructs" are different from "Instance
Documents" then they should be defined. If not, the title should just
say "Instance Documents."

14) 4.2 second example

The empty "export" element has an improperly closed XML comment.

15) 4.3 NOTE

"Head" is never defined. Does this mean right after the &lt;schema>
start-tag? Does it mean the very beginning of the document, or right
after the XML declaration if there is one? It needs to be clarified.

16) 4.5 first paragraph

"Composed" is emphasized, but never defined. I assume it has no
connection to compositor (production [36]).

17) 4.6 second example

I believe that second &lt;import start-tag should be an end-tag.

18) 6.1 paragraph beginning "The provision within..."

"The effective element item of an element item (call this OEI)..." Why?
What does the "O" stand for?


Overall, there's a lot of great stuff in the draft. I look forward to
the software that can work with these schema; kudos to Rick Jelliffe for
jumping right in there!

Bob DuCharme          www.snee.com/bob           <bob@  
snee.com>  "The elements be kind to thee, and make thy
spirits all of comfort!" Anthony and Cleopatra, III ii
Received on Thursday, 13 May 1999 11:08:25 UTC