Re: Comments on XML Schemas: Structures (long)

Please note that I am not responding in any official capacity.

At 12:25 PM 6/8/99 +0200, Ronald Bourret wrote:
>OVERALL COMMENTS:
>================
>This is a very nice first draft. It is well-organized, mostly readable (a 
>bit thick in places), and fairly thorough. I'm really happy to see that a 
>lot of work went into usability features such as archetypes, named content 
>models, attribute groups, etc.

Thanks.
>
>My apologies that these comments are so late.

No problem. I am glad to have your input.
>
>MAJOR COMMENTS:
>================
>Section 3.5 -- Archetype Refinement
>Archetype refinement is very scary.  I have no real technical grounds for 
>complaint -- it just feels overly complex and of limited use with respect 
>to most other features. While I understand the motivation, I would rather 
>see it postponed to a later version.

Thanks for your input. Refinement is a subject that will require 
further elaboration before it is ready for prime time, in my opinion.
I agree with the sentiment to postpone this feature.
>
>Section 3.6 -- Entities and Notations
>I strongly suggest that you split entity declarations into a separate 
>language, just as data types are a separate language. 

Actually, datatypes are written in a separate spec, but they are
an integral part of the schema language.

>While I concede the 
>need to declare entities in instance syntax (for example, in the Fragments 
>spec), they shouldn't be mixed together with the logical declarations. The 
>primary reason for this is that entities can be defined on a per-document 
>basis, while the logical declarations are defined on a per-document-class 
>basis. Mixing the two makes it difficult/impossible to define schemas that 
>are useful for an entire class of documents.

Nothing in XML Schema prevents definition of entities and notations
in an instance. The ability to define them in a schema and an instance 
is preserved from XML. Allowing this mixture is the status quo in XML.

I have been a proponent of allowing entity and notation definitions
in XML Schema, as witnessed in the SOX submission.

Having said that, Dan Connolly recent made a compelling argument
that entities definitions are not legitimate in an XML Schema
on the basis that XML requires all entity definitions to exist
in the DTD internal or external subsets. Since an XML Schema 
does not qualify as a subset in any sense, it is innappropriate
to provide for defining entities in an XML Schema. 

This issue will have to be played out in the Working Group.
>
>MINOR COMMENTS:
>==============
>Section 2.3 -- On 'types'
>What is gained by making the distinction between definitions and 
>declarations? In particular, this is reflected in the element tag names and 
>is unlikely to be understood by the unwashed masses (like me) who are 
>writing schemas.

With so many editors, it has been difficult at times to reach agreement
on the correct and precise use of terminology.

My $0.02 worth is that a 'declaration' binds a 'name' with a 'specification'
(or 'definition' if you like). Thus, an element type declaration in the 
abstract grammar looks like this:

	elementTypeDecl ::= elementTypeSpec 
	elementTypeSpec ::= (archetypeRef | archetypeSpec) exportControl global?

I think that we should adopt this terminology consistently.
>
>Section 3.1 -- The Schema
>a) What is the relationship between schemaIdentity and schemaName and why 
>are they separate? My first guess was that schemaName gave the location of 
>a schema document (for example, for use in an import statement) and that 
>schemaIdentity gave the (single, unique, unchanging) ID of the schema. 
>However, this theory doesn't seem to be true, as schemaRef refers to 
>schemaName, not schemaIdentity.  I suspect that schemaIdentity should be 
>replaced by schemaName (or vice versa).

As I remember this, the schemaIdentity and the schemaName are both URIs.
The 'schemaIdentity' declares the 'name' of the current schema.
The 'schemaIdentity' property is spelled 'name' on the schema element.

I agree that the 'schemaIdentity' could be re-spelled 'schemaName'.

>
>b) Why are schemaIdentity and version separate? In my mind, a different 
>version of a schema should have a different identity -- I certainly don't 
>want to try to validate a document using version A of a schema against 
>version B of that schema.

Whilst I agree that version information can more usefully be considered
to be part of the schema's URI (URL or URN), there was a strong push
for distinguishing version information in its own attribute.
>
>c) The Unique Definition constraint says that the same NCName cannot be 
>used for two definitions or declarations of the same type. However, 
>unparsed entities and parsed general entities occupy a single symbol space 
>in the schema language. Is this correct?  I thought these had separate 
>symbol spaces, but I can't find anything in the XML spec that actually 
>states this.

Actually, both XML and SGML provide for a combined collection of names
for all instance entities. That is, unparsed entities and both forms of 
parsed entities share a single set of names.

>
>Section 3.2 -- The Document and its Root
>The ability to declare a root element type is useful. For example, an 
>application that reads a schema document might reasonably expect it to 
>start with a <schema> element and consider that document to be invalid 
>otherwise. I think this ability should be added as an option.

-- In neither XML nor SGML, is there a means by which to identify
the root element in the DTD.

-- Both XML and SGML provide for declaring the root element in
the DOCTYPE declaration.

-- XML Schema provides no way to inform the application what the 
root element should be.

-- This is as it should be.

The root element of any XML document is evident through inspection.
Upon encountering any element, including the root, it is possible
to determine the namespace, and thus the schema definition, of that
element. There is no particular need to specify the expected root 
element beforehand.
>
>Section 3.3 -- References to Schema Constructs
>a) What is gained by the ability to reference a schema by both its 
>abbreviation and its name? Except for showing off typing skills, are there 
>any good reasons to refer to a schema by its name? If not, remove this 
>ability.

Please allow me to turn the question around: What is gained by the ability
to reference a schema (or anything) by its abbreviation?

In 'Namespaces in XML', the prefix was introduced because not all of 
the characters that are valid in a URI are valid in an XML 'name'.
Therefore, we needed a shorthand (or macro) that could be expanded
to a full URI. 

In XML Schema, there is no such syntactic limitation to overcome.
A schema reference can more readily and precisely be expressed 
with the full URI.

Having said that, it is worth noting that there is a social reason
to allow the use of namespace prefix and URI binding, as well as
the prefix expansion protocol.
>
>b) Get rid of the schemaAbbrev attribute and use prefixed names as is done 
>with namespaces. For example, <elementTypeRef name="HTML:BLOCKQUOTE"/>. 
>This is much easier to read and more intuitive for people accustomed to 
>namespaces. The difference in processing cost is not significant.

Perhaps this is more intuitive for anyone who is now used to thinking
about namespaces in XML instances. But I defy you to explain the 
meaning of the following constructs in terms of 'Namespaces in XML':

	<elementTypeRef name="HTML:BLOCKQUOTE"/>
	<archetypeRef   name="HTML:BLOCKQUOTE"/>
	<attrGroupRef   name="HTML:BLOCKQUOTE"/>
	<modelGroupRef  name="HTML:BLOCKQUOTE"/>

According to 'Namespaces in XML', these are all the same name.

>
>Section 3.4.1 -- Datatype Definition
>a) Specialization of data types at point of use is too flexible and too 
>likely to cause confusion. Instead, require people to define and use new 
>data types. However, declaring default values at point of use should be 
>retained, as this applies to the use of the type and not the type itself.

That is good input.

>
>b) Why is fixed part of the data type qualification? This has nothing to do 
>with data types and should be part of the attribute or element type 
>constraint.

That is good input.
>
>c) What issues are there about aggregate data types that need to be 
>resolved?  An aggregate data type is simply an element content model.

There is another point of view that says that an aggregate datatype
is a fielded datatype like NMTOKENS or IDREFS, that provides
for mutiple values of the same type or like dateTime, that provides
for a segmented value space with specific delimiters and range checking
on the individual types that comprise the aggregate.

This debate be settled in the WG.
>
>Section 3.4.2 -- Archetype Definition
>What is a default element value and when is it applied? When the element is 
>empty? When it is missing? If the latter, and more than one instance of the 
>element type is legal, how many are created? My gut feeling is to delete 
>this.

Providing for default and fixed values for element content is 
useful in designing applications that 'fill themselves in'.
A document creation tool, such as an editor or invoice generator, 
can automatically insert content to satisfy local constraints
on a public schema.

Furthermore, this eliminates another of the differences between 
elements and attributes. When deciding between using an element
or an attribute, the primary consideration is whether or not the
value is allowed to contain element content, and whether or not
there may be more than one instance on/in an element. 

Other considerations, such as whether or not a schema author can specify 
a datatype, or a default or fixed value, should be eradicated
-- in my opinion.
>
>Section 3.4.3 -- Attribute Declaration
>Why should there be a default attribute data type?  There isn't now.

For convenience. Specifying a datatype adds bytes to the data stream,
and may require repetitive manual insertion.
>
>Section 3.4.6 -- Mixed Content
>a) The use of a <mixed> element with no children to indicate PCDATA-only 
>content is not intuitive to most users and is likely to lead to confusion. 
>Either:
>   i) Add a <pcdata> element, or
>   ii) Require one or more elementTypeRef's under <mixed>. PCDATA-only 
>content is stated with <datatypeRef name="string"/>.
>I prefer (ii), as it means there is only one way to declare PCDATA-only 
>content.

It may not be intuitive to you, but it is consistent with XML.
	i)  I think that you can <datatypeRef name='string'/>
	ii) That would make it difficult to create mixed archetypes
	    that are later refined with their element mixtures.
>
>b) What is the purpose of the NOTE? That is, why is it important to be able 
>to declare PCDATA-only content without using the above datatypeRef?

I think that it was intended to aid understanding. If it did not,
we should probably wordsmith it until it does.
>
>Section 3.4.9 -- Element Type Declarations
>Locally-scoped element type names break XML 1.0 validity and are probably 
>not worth the confusion they will cause -- remember that most document 
>authors are not programmers and are not likely to understand scoping. I 
>suggest you delete them.

Preserving XML 1.0 validity is not a requirement.

The WG seems to split on this question. I tend to agree with Ron,
but I would not lay down in the road.

>
>Section 4.1 -- Associating Instance Document Constructs with Corresponding 
>Schemas
>What is the relationship between schemaIdentity, schemaName, and the 
>namespace URI? My guess is that all should be the same, but this is never 
>stated.

The schemaIdentity, spelled <schema name="..." ... >, is the URI 
of the current schema. The schemaName="..." property is a URI
that is a reference to a schema. The schemaAbbrev property is
an 'NCname' that is a reference to an imported schemaName.
>
>Section 4.2 -- Exporting Schema Constructs
>What is the motivation for export control?  It doesn't seem applicable to 
>XML.  Unlike programming languages, where implementation details can be 
>hidden from the user, everything in an XML document is visible.  Saying 
>that I can see an archetype, attribute, content model, etc. but can't use 
>it is just plain silly.  It doesn't really mean I can't use the schema 
>object -- it just means that I have to cut and paste instead of using the 
>schema language's handy, built-in referencing mechanisms.

Several people have commented that this is a bit over-engineered.
>
>Section 4.6 -- Import Restrictions
>a) The note asks whether imported definitions are re-exported and states 
>that they are not due to difficulties in managing abbreviation 
>associations. I don't understand this -- such difficulties must already be 
>handled. For example, suppose schema A imports element b from schema B, and 
>element b includes element c imported from schema C. If schema B uses the 
>same abbreviation for schema C that schema A uses for schema B, the 
>processor must resolve this today. This is not difficult, as the processor 
>maintains abbreviation lists on a per-schema basis and does its processing 
>based on schemaIdentity/Name, not the abbreviation. Thus, imported 
>definitions should be re-exported.

Several people have commented that this is a bit over-engineered.
>
>b) The note in section 4.7 asks whether import implicitly imports features 
>not explicitly imported or only imports such features when needed by 
>explicitly imported features. The latter case is the correct one, as it 
>makes the schema author's intentions clear and forces them to import 
>exactly what they want. (It also saves memory in the processor, which only 
>needs to save those schema items needed by the importing schema, rather 
>than the entire imported schema. Whether the memory saved is significant 
>depends on the size of the imported schema and how much was imported.)

Good input.
>
>Section 4.7 -- Schema Inclusion
>a) Does schema inclusion solve any problems that can't be solved with 
>external entities? If not, delete it.

First, we use XML instance syntax for all constructs, thereby ensuring
that we can build tools with standard XML components such as the DOM,
the Information Set, and even SAX.

	<include schemaName='myOtherSchema'/>

Second, it reduces the number of steps from two (declare/reference):

	<!ENTITY myOtherSchema 'myOtherSchema.xsd'>
	&myOtherSchema;

>
>b) If includes are kept, it *must* be an error if identically named items 
>are encountered twice. Using the first definition is a bad practice and 
>open to abuse. (A reasonable compromise would allow multiple, identical 
>definitions.)

I agree that it must be an error. There is an SVC that expresses 
this rule.
>
>Section 4.8 -- Access to Schemata
>Basing a schema's location on its name (namespace URI, schemaName, or 
>schemaIdentity) is a bad idea. A generic schema processor, such as a 
>generic validation module or a schema-driven editor, has only one realistic 
>choice when it comes to locating the schema document, and that is to hope 
>that the URI is a URL and try to resolve it. (I don't believe that schema 
>name servers are going to appear any time soon, if ever.)  Unfortunately, 
>this forms a one-to-one relationship between schema names and locations, 
>which precludes multiple copies of the schema. It also means that, in most 
>cases, the processor must be connected to the Web.

Reality is in the eye of the beholder. For anyone that chooses to use
URL's exclusively, and in agreement with his/her document trading partners,
your approach will work just fine. For anyone that chooses to use URNs
in agreement with his/her trading partners, that will work too. The market
can decide whether the URL or more general URI approach will prevail.
>
>One possible solution is to separate location information from the schema's 
>name. This is needed in import and include statements within the schema and 
>also in whatever mechanism (not namespace declarations) is used to 
>associate schemas with instance documents. The location information can 
>still be a URI, and the mechanism by which the URI is resolved to an actual 
>location can still be processor-specific, but this does allow generic 
>processors to simply resolve the URI as a URL or fail.

The separation of schemaName from schemaLocation is application- and
processing environment-specific. Whether through URN lookup mechanisms,
or OASIS catalogs, or URL-redirection and content negotiation, the
schemaLocation must be handled outside of the schema -- in my opinion
>
>Section 5 -- Documenting schemas
>In considering documentation elements, please consider the following:
>a) Display names, which can be different from item names. For example, I 
>want an element type of SalesOrder, but I want this to result in a form 
>name of Sales Order.
>b) Support for multiple languages (English, French, etc.)

Thanks for the good input.
>
>Section 6.1 -- Schema Validity
>a) The sentence after the definition of schema-ready for documents states 
>that a document is schema-ready even if it has no namespace declarations. 
>How? The definition of schema-ready for documents states that the document 
>is schema-ready if all of its elements are schema-ready, and the definition 
>of schema-ready for elements states that an element is schema-ready if any 
>of its namespace declarations resolves to a schema. This doesn't seem to 
>cover the case where none of the namespace declarations resolves to a 
>schema or there are no relevant namespace declarations. I think these cases 
>need to be added explicitly to the definition of schema-ready for elements.

Henry can comment on this.
>
>b) The definition of schema-valid allows partial validation (in the XML 1.0 
>sense) of documents. While this is undoubtedly useful, many (most?) 
>applications will want full validity (in the XML 1.0 sense). I think you 
>need a definition such as totally-schema-valid, which is the same as 
>schema-valid except that all elements must be schema-governed, and a way 
>for applications to request this.

Thanks for the question. Henry can answer this.
>
>c) In the description of how DTDs and schemas interact, please explain what 
>happens when the DTD and schema conflict. This is (perhaps unintentionally) 
>covered in part by the first bullet, which allows for this case.  I think 
>it is OK to simply say that this is the document author's problem.

Thanks for the question. Henry can answer this.
>
>Section 6.2 -- Responsibilities of Schema-aware processors
>Why is item 6 (exposing the combined information set) required for 
>conformance?

Thanks for the question. Henry can answer this.
>
>TYPOS, ERRORS, NITPICKING, ETC.
>==============================
>Section 1.3 -- Relationship to Other Work
>Consider mentioning the Fragment spec, which wants to use schema syntax 
>(including entity definitions) to represent DTDs in line.

Thanks for the input.
>
>Section 2.4 -- Schemas and their component parts
>The table states that archetype refinements are named.  I assume this is an 
>error, as I see no way to name a refinement (as distinct from an 
>archetype).  If there is a way to name a refinement, it should share the 
>same symbol space with archetypes.

Thanks for the input.
>
>Section 3.3 -- References to Schema Constructs
>The Consistent Import constraint states, "A schemaAbbrev or schemaName in a 
>schemaRef must be declared in an Schema Import of the current schema, ..." 
> I think it would be clearer to say "... in the current schema..."  "Of" 
>implies that the current schema is being imported somewhere else.

Thanks for the input.
>
>Section 3.4.1 -- Datatype Definition
>The second paragraph states that "datatype[s constrain]...the character 
>data contents of elements". This should be more specific and state that 
>they constrain the character data content of elements that can only contain 
>character data. Clearly, data types do not constrain character data when 
>character and element content is present.

Thanks for the input.
>
>Section 3.4.2 -- Archetype Definition
>Archetypes are really nice, but the name is obscure. How about "base type", 
>"base element type", or "abstract type" instead?

Actually, I kinda like 'archetype' since I came up with it.
As you can see, 'basetype' is taken in the datatype spec.
I think that 'baseElementType' is a bit long, but we had not
considered it. And 'abstractType' was considered to have broader
implications for datatypes and element types.

The term 'archetype' is inherited from HyTime 'Architectural Forms'.
>
>Section 3.4.3 -- Attribute Declaration
>In production [24], required should be followed by a "?". If it is 
>understood to be a choice of required/not required, it needs a production 
>of its own.

Thanks for the good catch.
>
>Section 3.4.4 -- Attribute Group Definition
>Productions [26], [27], and [29] do not match the DTD in appendix B:
>a) [26] should be: attrGroupSpec ::= (attrDecl | attrGroupRef)+ 
>exportControl
>b) [27] should be: attrGroupRef ::= attrGroupName
>c) [29] should be deleted.  (What is its purpose, anyway?)

[26] and [27]: I agree.
[29]: TBD. Thanks for the input.
>
>Section 3.4.7 -- Element-Only Content
>The example states that the default of maxOccur is 1. In fact, maxOccur has 
>no default.

Right. It is understood to be '1' when not specified.
>
>Section 3.4.9 -- Element Type Declarations
>If locally-scoped element type names are retained, two changes are needed:
>
>a) At the start of the fifth paragraph, change "An elementTypeDecl may also 
>appear within a modelElt..." to "... within a modelElt or mixed..."

Thanks.
>
>b) Clarify section 2.5 with respect to the symbol spaces of attributes and 
>element types; at the very least, simply add cross-references to the 
>relevant explanations (sections 3.4.3 and 3.4.9).

Thanks.
>
>Section 3.6 -- Entities and Notations
>Why are notations included with entities?  Entities are a physical 
>construct and notations are a logical construct.

Notations are (mostly) used with external unparsed entities.
While notations may also be used to specify the content of
an elements (qua XML and SGML), they are hardly ever used 
that way.
>
>Section 3.6.1 & .2 -- Internal/External Parsed Entity Declaration
>a) Change "internal/external parsed entity" to "internal/external parsed 
>general entity" to make it clear you are defining general entities and not 
>parameter entities.

Hmmm. Maybe. Thanks for the input.
>
>b) In the example in 3.6.1, change "... in instances of the containing 
>schema..." to "in documents that use the schema..."

Hmmm. Maybe. Thanks for the input.
>
>Section 4.1 -- Associating Instance Document Constructs with Corresponding 
>Schemas
>In the last sentence of the example, "content model" should be "archetype".
>
>Throughout Entire Specification
>a) There are numerous grammatical errors in the use of "a" and "an" -- 
>admittedly minor, but annoying.
>
>b) Any chance that "datatype" could be made two words again?
>

Thanks for your valuable input.

----------------------------------------------------------
Murray Maloney, Esq.          Phone: (905) 509-9120
Muzmo Communication Inc.      Fax:   (905) 509-8637
671 Cowan Circle              Email: murray@muzmo.com
Pickering, Ontario 		Web:   http://www.muzmo.com
Canada, L1W 3K6    		

Received on Tuesday, 8 June 1999 14:26:27 UTC