Comments on XML Schemas: Structures (long)

OVERALL COMMENTS:
================
This is a very nice first draft. It is well-organized, mostly readable (a 
bit thick in places), and fairly thorough. I'm really happy to see that a 
lot of work went into usability features such as archetypes, named content 
models, attribute groups, etc.

My apologies that these comments are so late.

MAJOR COMMENTS:
================
Section 3.5 -- Archetype Refinement
Archetype refinement is very scary.  I have no real technical grounds for 
complaint -- it just feels overly complex and of limited use with respect 
to most other features. While I understand the motivation, I would rather 
see it postponed to a later version.

Section 3.6 -- Entities and Notations
I strongly suggest that you split entity declarations into a separate 
language, just as data types are a separate language. While I concede the 
need to declare entities in instance syntax (for example, in the Fragments 
spec), they shouldn't be mixed together with the logical declarations. The 
primary reason for this is that entities can be defined on a per-document 
basis, while the logical declarations are defined on a per-document-class 
basis. Mixing the two makes it difficult/impossible to define schemas that 
are useful for an entire class of documents.

MINOR COMMENTS:
==============
Section 2.3 -- On 'types'
What is gained by making the distinction between definitions and 
declarations? In particular, this is reflected in the element tag names and 
is unlikely to be understood by the unwashed masses (like me) who are 
writing schemas.

Section 3.1 -- The Schema
a) What is the relationship between schemaIdentity and schemaName and why 
are they separate? My first guess was that schemaName gave the location of 
a schema document (for example, for use in an import statement) and that 
schemaIdentity gave the (single, unique, unchanging) ID of the schema. 
However, this theory doesn't seem to be true, as schemaRef refers to 
schemaName, not schemaIdentity.  I suspect that schemaIdentity should be 
replaced by schemaName (or vice versa).

b) Why are schemaIdentity and version separate? In my mind, a different 
version of a schema should have a different identity -- I certainly don't 
want to try to validate a document using version A of a schema against 
version B of that schema.

c) The Unique Definition constraint says that the same NCName cannot be 
used for two definitions or declarations of the same type. However, 
unparsed entities and parsed general entities occupy a single symbol space 
in the schema language. Is this correct?  I thought these had separate 
symbol spaces, but I can't find anything in the XML spec that actually 
states this.

Section 3.2 -- The Document and its Root
The ability to declare a root element type is useful. For example, an 
application that reads a schema document might reasonably expect it to 
start with a <schema> element and consider that document to be invalid 
otherwise. I think this ability should be added as an option.

Section 3.3 -- References to Schema Constructs
a) What is gained by the ability to reference a schema by both its 
abbreviation and its name? Except for showing off typing skills, are there 
any good reasons to refer to a schema by its name? If not, remove this 
ability.

b) Get rid of the schemaAbbrev attribute and use prefixed names as is done 
with namespaces. For example, <elementTypeRef name="HTML:BLOCKQUOTE"/>. 
This is much easier to read and more intuitive for people accustomed to 
namespaces. The difference in processing cost is not significant.

Section 3.4.1 -- Datatype Definition
a) Specialization of data types at point of use is too flexible and too 
likely to cause confusion. Instead, require people to define and use new 
data types. However, declaring default values at point of use should be 
retained, as this applies to the use of the type and not the type itself.

b) Why is fixed part of the data type qualification? This has nothing to do 
with data types and should be part of the attribute or element type 
constraint.

c) What issues are there about aggregate data types that need to be 
resolved?  An aggregate data type is simply an element content model.

Section 3.4.2 -- Archetype Definition
What is a default element value and when is it applied? When the element is 
empty? When it is missing? If the latter, and more than one instance of the 
element type is legal, how many are created? My gut feeling is to delete 
this.

Section 3.4.3 -- Attribute Declaration
Why should there be a default attribute data type?  There isn't now.

Section 3.4.6 -- Mixed Content
a) The use of a <mixed> element with no children to indicate PCDATA-only 
content is not intuitive to most users and is likely to lead to confusion. 
Either:
   i) Add a <pcdata> element, or
   ii) Require one or more elementTypeRef's under <mixed>. PCDATA-only 
content is stated with <datatypeRef name="string"/>.
I prefer (ii), as it means there is only one way to declare PCDATA-only 
content.

b) What is the purpose of the NOTE? That is, why is it important to be able 
to declare PCDATA-only content without using the above datatypeRef?

Section 3.4.9 -- Element Type Declarations
Locally-scoped element type names break XML 1.0 validity and are probably 
not worth the confusion they will cause -- remember that most document 
authors are not programmers and are not likely to understand scoping. I 
suggest you delete them.

Section 4.1 -- Associating Instance Document Constructs with Corresponding 
Schemas
What is the relationship between schemaIdentity, schemaName, and the 
namespace URI? My guess is that all should be the same, but this is never 
stated.

Section 4.2 -- Exporting Schema Constructs
What is the motivation for export control?  It doesn't seem applicable to 
XML.  Unlike programming languages, where implementation details can be 
hidden from the user, everything in an XML document is visible.  Saying 
that I can see an archetype, attribute, content model, etc. but can't use 
it is just plain silly.  It doesn't really mean I can't use the schema 
object -- it just means that I have to cut and paste instead of using the 
schema language's handy, built-in referencing mechanisms.

Section 4.6 -- Import Restrictions
a) The note asks whether imported definitions are re-exported and states 
that they are not due to difficulties in managing abbreviation 
associations. I don't understand this -- such difficulties must already be 
handled. For example, suppose schema A imports element b from schema B, and 
element b includes element c imported from schema C. If schema B uses the 
same abbreviation for schema C that schema A uses for schema B, the 
processor must resolve this today. This is not difficult, as the processor 
maintains abbreviation lists on a per-schema basis and does its processing 
based on schemaIdentity/Name, not the abbreviation. Thus, imported 
definitions should be re-exported.

b) The note in section 4.7 asks whether import implicitly imports features 
not explicitly imported or only imports such features when needed by 
explicitly imported features. The latter case is the correct one, as it 
makes the schema author's intentions clear and forces them to import 
exactly what they want. (It also saves memory in the processor, which only 
needs to save those schema items needed by the importing schema, rather 
than the entire imported schema. Whether the memory saved is significant 
depends on the size of the imported schema and how much was imported.)

Section 4.7 -- Schema Inclusion
a) Does schema inclusion solve any problems that can't be solved with 
external entities? If not, delete it.

b) If includes are kept, it *must* be an error if identically named items 
are encountered twice. Using the first definition is a bad practice and 
open to abuse. (A reasonable compromise would allow multiple, identical 
definitions.)

Section 4.8 -- Access to Schemata
Basing a schema's location on its name (namespace URI, schemaName, or 
schemaIdentity) is a bad idea. A generic schema processor, such as a 
generic validation module or a schema-driven editor, has only one realistic 
choice when it comes to locating the schema document, and that is to hope 
that the URI is a URL and try to resolve it. (I don't believe that schema 
name servers are going to appear any time soon, if ever.)  Unfortunately, 
this forms a one-to-one relationship between schema names and locations, 
which precludes multiple copies of the schema. It also means that, in most 
cases, the processor must be connected to the Web.

One possible solution is to separate location information from the schema's 
name. This is needed in import and include statements within the schema and 
also in whatever mechanism (not namespace declarations) is used to 
associate schemas with instance documents. The location information can 
still be a URI, and the mechanism by which the URI is resolved to an actual 
location can still be processor-specific, but this does allow generic 
processors to simply resolve the URI as a URL or fail.

Section 5 -- Documenting schemas
In considering documentation elements, please consider the following:
a) Display names, which can be different from item names. For example, I 
want an element type of SalesOrder, but I want this to result in a form 
name of Sales Order.
b) Support for multiple languages (English, French, etc.)

Section 6.1 -- Schema Validity
a) The sentence after the definition of schema-ready for documents states 
that a document is schema-ready even if it has no namespace declarations. 
How? The definition of schema-ready for documents states that the document 
is schema-ready if all of its elements are schema-ready, and the definition 
of schema-ready for elements states that an element is schema-ready if any 
of its namespace declarations resolves to a schema. This doesn't seem to 
cover the case where none of the namespace declarations resolves to a 
schema or there are no relevant namespace declarations. I think these cases 
need to be added explicitly to the definition of schema-ready for elements.

b) The definition of schema-valid allows partial validation (in the XML 1.0 
sense) of documents. While this is undoubtedly useful, many (most?) 
applications will want full validity (in the XML 1.0 sense). I think you 
need a definition such as totally-schema-valid, which is the same as 
schema-valid except that all elements must be schema-governed, and a way 
for applications to request this.

c) In the description of how DTDs and schemas interact, please explain what 
happens when the DTD and schema conflict. This is (perhaps unintentionally) 
covered in part by the first bullet, which allows for this case.  I think 
it is OK to simply say that this is the document author's problem.

Section 6.2 -- Responsibilities of Schema-aware processors
Why is item 6 (exposing the combined information set) required for 
conformance?

TYPOS, ERRORS, NITPICKING, ETC.
==============================
Section 1.3 -- Relationship to Other Work
Consider mentioning the Fragment spec, which wants to use schema syntax 
(including entity definitions) to represent DTDs in line.

Section 2.4 -- Schemas and their component parts
The table states that archetype refinements are named.  I assume this is an 
error, as I see no way to name a refinement (as distinct from an 
archetype).  If there is a way to name a refinement, it should share the 
same symbol space with archetypes.

Section 3.3 -- References to Schema Constructs
The Consistent Import constraint states, "A schemaAbbrev or schemaName in a 
schemaRef must be declared in an Schema Import of the current schema, ..." 
 I think it would be clearer to say "... in the current schema..."  "Of" 
implies that the current schema is being imported somewhere else.

Section 3.4.1 -- Datatype Definition
The second paragraph states that "datatype[s constrain]...the character 
data contents of elements". This should be more specific and state that 
they constrain the character data content of elements that can only contain 
character data. Clearly, data types do not constrain character data when 
character and element content is present.

Section 3.4.2 -- Archetype Definition
Archetypes are really nice, but the name is obscure. How about "base type", 
"base element type", or "abstract type" instead?

Section 3.4.3 -- Attribute Declaration
In production [24], required should be followed by a "?". If it is 
understood to be a choice of required/not required, it needs a production 
of its own.

Section 3.4.4 -- Attribute Group Definition
Productions [26], [27], and [29] do not match the DTD in appendix B:
a) [26] should be: attrGroupSpec ::= (attrDecl | attrGroupRef)+ 
exportControl
b) [27] should be: attrGroupRef ::= attrGroupName
c) [29] should be deleted.  (What is its purpose, anyway?)

Section 3.4.7 -- Element-Only Content
The example states that the default of maxOccur is 1. In fact, maxOccur has 
no default.

Section 3.4.9 -- Element Type Declarations
If locally-scoped element type names are retained, two changes are needed:

a) At the start of the fifth paragraph, change "An elementTypeDecl may also 
appear within a modelElt..." to "... within a modelElt or mixed..."

b) Clarify section 2.5 with respect to the symbol spaces of attributes and 
element types; at the very least, simply add cross-references to the 
relevant explanations (sections 3.4.3 and 3.4.9).

Section 3.6 -- Entities and Notations
Why are notations included with entities?  Entities are a physical 
construct and notations are a logical construct.

Section 3.6.1 & .2 -- Internal/External Parsed Entity Declaration
a) Change "internal/external parsed entity" to "internal/external parsed 
general entity" to make it clear you are defining general entities and not 
parameter entities.

b) In the example in 3.6.1, change "... in instances of the containing 
schema..." to "in documents that use the schema..."

Section 4.1 -- Associating Instance Document Constructs with Corresponding 
Schemas
In the last sentence of the example, "content model" should be "archetype".

Throughout Entire Specification
a) There are numerous grammatical errors in the use of "a" and "an" -- 
admittedly minor, but annoying.

b) Any chance that "datatype" could be made two words again?

Thanks,

-- Ron Bourret

Received on Tuesday, 8 June 1999 06:28:23 UTC