RE: Comments on XML Schemas: Structures (long) from Ronald Bourret on 1999-06-09 (www-xml-schema-comments@w3.org from April to June 1999)

From: Ronald Bourret <rbourret@ito.tu-darmstadt.de>
Date: Wed, 9 Jun 1999 11:53:52 +0200
To: "'Murray Maloney'" <murray@muzmo.com>
Cc: "'www-xml-schema-comments@w3.org'" <www-xml-schema-comments@w3.org>
Message-ID: <01BEB26E.BF4920A0@grappa.ito.tu-darmstadt.de>
Murray Maloney wrote:

> >Section 3.6 -- Entities and Notations
> >I strongly suggest that you split entity declarations into a separate
> >language, just as data types are a separate language.
>
> Actually, datatypes are written in a separate spec, but they are
> an integral part of the schema language.

True.

> Nothing in XML Schema prevents definition of entities and notations
> in an instance. The ability to define them in a schema and an instance
> is preserved from XML. Allowing this mixture is the status quo in XML.

The status quo (which is not sacred, as things such as nearly well-formed 
XML show) is broken here and should be fixed.  The physical layout of an 
XML document has no more to do with the logical schema to which that 
document conforms than the physical layout of database data on a disk has 
to do with the logical tables and columns into which that data is arranged. 
DTDs mix these concepts together and schemas are our best chance for 
separating them.

> >Section 3.1 -- The Schema
> >c) The Unique Definition constraint says that the same NCName cannot be
> >used for two definitions or declarations of the same type. However,
> >unparsed entities and parsed general entities occupy a single symbol 
space
> >in the schema language. Is this correct?  I thought these had separate
> >symbol spaces, but I can't find anything in the XML spec that actually
> >states this.
>
> Actually, both XML and SGML provide for a combined collection of names
> for all instance entities. That is, unparsed entities and both forms of
> parsed entities share a single set of names.

This is not true for parameter and general entities. To quote the fifth 
paragraph of section 4 of the XML spec, "Furthermore, they occupy different 
namespaces; a parameter entity and a general entity with the same name are 
two distinct entities." But you are correct that unparsed entities share 
the same namespace with parsed general entities.  I finally noticed that 
the second sentence in 4.2.2 refers to them as "general unparsed entity," 
which means they share the same namespace.

> >Section 3.2 -- The Document and its Root
> >The ability to declare a root element type is useful. For example, an
> >application that reads a schema document might reasonably expect it to
> >start with a <schema> element and consider that document to be invalid
> >otherwise. I think this ability should be added as an option.
>
> -- In neither XML nor SGML, is there a means by which to identify
> the root element in the DTD.
>
> -- Both XML and SGML provide for declaring the root element in
> the DOCTYPE declaration.
>
> -- XML Schema provides no way to inform the application what the
> root element should be.
>
> -- This is as it should be.
>
> The root element of any XML document is evident through inspection.
> Upon encountering any element, including the root, it is possible
> to determine the namespace, and thus the schema definition, of that
> element. There is no particular need to specify the expected root
> element beforehand.

I disagree that there is no particular need to specify the expected root 
beforehand. In fact, many applications expect a particular root.  For 
example, if I write a module that reads a schema and validates an instance 
document against that schema, that module clearly expects the root element 
of the schema to be <schema> and will throw an error if it is not.

Thus, the root element type is part of the expected structure of the 
document and part of its logical schema. Put another way, we can add an 
optional root element type declaration to schemas and move that part of 
validation to a generic processor, or we can leave it out and force 
applications to validate the root element type themselves, just like they 
do with data types today.

> >Section 3.3 -- References to Schema Constructs
> >a) What is gained by the ability to reference a schema by both its
> >abbreviation and its name? Except for showing off typing skills, are 
there
> >any good reasons to refer to a schema by its name? If not, remove this
> >ability.
>
> Please allow me to turn the question around: What is gained by the 
ability
> to reference a schema (or anything) by its abbreviation?
>
> In 'Namespaces in XML', the prefix was introduced because not all of
> the characters that are valid in a URI are valid in an XML 'name'.
> Therefore, we needed a shorthand (or macro) that could be expanded
> to a full URI.
>
> In XML Schema, there is no such syntactic limitation to overcome.
> A schema reference can more readily and precisely be expressed
> with the full URI.
>
> Having said that, it is worth noting that there is a social reason
> to allow the use of namespace prefix and URI binding, as well as
> the prefix expansion protocol.

The social reason is precisely my point. This is entirely a user interface 
issue. The abbreviation-plus-colon-plus-name is easier to read, easier to 
write, and provides a familiar entry point for people who know namespaces. 
Of the following three choices, I find the third by far the easiest to read 
and write:

   i) <elementTypeRef name="bar" schemaAbbrev="foo">
   ii) <elementTypeRef name="bar" schemaName="http://foo">
   iii) <elementTypeRef name="foo:bar">

Regardless of which way is chosen, there should be only one.  I see no 
compelling reason to provide multiple ways to do something as simple as 
this.

> Perhaps this is more intuitive for anyone who is now used to thinking
> about namespaces in XML instances. But I defy you to explain the
> meaning of the following constructs in terms of 'Namespaces in XML':
>
> 	<elementTypeRef name="HTML:BLOCKQUOTE"/>
> 	<archetypeRef   name="HTML:BLOCKQUOTE"/>
> 	<attrGroupRef   name="HTML:BLOCKQUOTE"/>
> 	<modelGroupRef  name="HTML:BLOCKQUOTE"/>
>
> According to 'Namespaces in XML', these are all the same name.

These are definitely the same according to 'Namespaces in XML', as 
namespaces do not apply to element or attribute values, but that is beside 
the point. What I am suggesting is that schemas borrow a familiar, 
easy-to-use mechanism and apply it to a similar situation -- that of naming 
values that will be used as markup in instance documents.

> >c) What issues are there about aggregate data types that need to be
> >resolved?  An aggregate data type is simply an element content model.
>
> There is another point of view that says that an aggregate datatype
> is a fielded datatype like NMTOKENS or IDREFS, that provides
> for mutiple values of the same type or like dateTime, that provides
> for a segmented value space with specific delimiters and range checking
> on the individual types that comprise the aggregate.

Fair enough -- I hadn't thought about that.

> >Section 3.4.2 -- Archetype Definition
> Providing for default and fixed values for element content is
> useful in designing applications that 'fill themselves in'.
> A document creation tool, such as an editor or invoice generator,
> can automatically insert content to satisfy local constraints
> on a public schema.
>
> Furthermore, this eliminates another of the differences between
> elements and attributes. When deciding between using an element
> or an attribute, the primary consideration is whether or not the
> value is allowed to contain element content, and whether or not
> there may be more than one instance on/in an element.

Good point.  So when is the element default applied?  When the element 
isn't there or when it is empty?

> >Section 3.4.6 -- Mixed Content
> >a) The use of a <mixed> element with no children to indicate PCDATA-only 
> >content is not intuitive to most users and is likely to lead to 
confusion.
> >Either:
> >   i) Add a <pcdata> element, or
> >   ii) Require one or more elementTypeRef's under <mixed>. PCDATA-only
> >content is stated with <datatypeRef name="string"/>.
> >I prefer (ii), as it means there is only one way to declare PCDATA-only
> >content.
>
> It may not be intuitive to you, but it is consistent with XML.
> 	i)  I think that you can <datatypeRef name='string'/>
> 	ii) That would make it difficult to create mixed archetypes
> 	    that are later refined with their element mixtures.

Point (ii) is a good one that I hadn't thought of. I'm still unhappy about 
there being multiple ways to say the same, simple thing, but I don't see a 
way around point (ii) otherwise.

> >Section 3.4.9 -- Element Type Declarations
> >Locally-scoped element type names break XML 1.0 validity and are 
probably
> >not worth the confusion they will cause -- remember that most document
> >authors are not programmers and are not likely to understand scoping. I
> >suggest you delete them.
>
> Preserving XML 1.0 validity is not a requirement.
>
> The WG seems to split on this question. I tend to agree with Ron,
> but I would not lay down in the road.

Fair enough about 1.0 validity -- that was really a bogus argument on my 
part.  My main objection is that I don't think this is worth the confusion 
it will cause.

> >Section 4.1 -- Associating Instance Document Constructs with 
Corresponding
> >Schemas
> >What is the relationship between schemaIdentity, schemaName, and the
> >namespace URI? My guess is that all should be the same, but this is 
never
> >stated.
>
> The schemaIdentity, spelled <schema name="..." ... >, is the URI
> of the current schema. The schemaName="..." property is a URI
> that is a reference to a schema. The schemaAbbrev property is
> an 'NCname' that is a reference to an imported schemaName.

Unfortunately, the spec does not state how schemaName refers to a schema. 
The reasonable assumption is that the schemaName in an import statement 
must be identical to the schemaIdentity in the imported schema. However, 
there is nothing in the spec to prevent the vastly-less-useful 
interpretation that schemaName is simply a URI that locally identifies the 
imported schema and that the same schema could be imported into different 
schemas using different schemaNames.  For example:

myLocalCopyOfFoo.xsd:
   <schema name="http://foo">...</schema>

yourLocalCopyOfFoo.xsd:
   <schema name="http://foo">...</schema>

Schema A:
   <import schemaAbbrev="foo" schemaName="myLocalCopyOfFoo.xsd"/>

Schema B:
   <import schemaAbbrev="foo" schemaName="yourLocalCopyOfFoo.xsd"/>

The same problem exists in the relationship between schemaIdentity and the 
URI used in namespace declarations.

> >Section 4.7 -- Schema Inclusion
> >a) Does schema inclusion solve any problems that can't be solved with
> >external entities? If not, delete it.
>
> First, we use XML instance syntax for all constructs, thereby ensuring
> that we can build tools with standard XML components such as the DOM,
> the Information Set, and even SAX.
>
> 	<include schemaName='myOtherSchema'/>
>
> Second, it reduces the number of steps from two (declare/reference):
>
> 	<!ENTITY myOtherSchema 'myOtherSchema.xsd'>
> 	&myOtherSchema;

I'm still not convinced.  This forces schema processors to do work that 
would otherwise be done for them by the parser.  Can you give me an example 
showing how this benefits the user in a way that external entities cannot? 
 (Saving one line is not enough, nor is preserving physical document 
structure with SAX, as that was never a goal of SAX and will probably be 
solved with SAX2 anyway.)

> >Section 4.8 -- Access to Schemata
> The separation of schemaName from schemaLocation is application- and
> processing environment-specific. Whether through URN lookup mechanisms,
> or OASIS catalogs, or URL-redirection and content negotiation, the
> schemaLocation must be handled outside of the schema -- in my opinion

I half-way agree with you. It is much better theoretical design to just use 
schema names to refer to schemas. Unfortunately, this means that I can't 
ever write a schema that can be reliably used by all schema processors. 
This is a major hole and could at least partially be solved by introducing 
separate location information.

I'm very open to other suggestions, but leaving it to the marketplace means 
that people will simply name their schemas with the URL of the schema 
location, which is both non-scaleable and requires schema processors to be 
connected to the Web.

> >Section 3.4.7 -- Element-Only Content
> >The example states that the default of maxOccur is 1. In fact, maxOccur 
has
> >no default.
>
> Right. It is understood to be '1' when not specified.

Um, no.  It is understood to be infinity when not specified.  If it was 
understood to be 1, there would be no way to duplicate '+' and '*', since 
any value of maxOccur would necessarily be less than infinity.

> >Section 3.6 -- Entities and Notations
> >Why are notations included with entities?  Entities are a physical
> >construct and notations are a logical construct.
>
> Notations are (mostly) used with external unparsed entities.
> While notations may also be used to specify the content of
> an elements (qua XML and SGML), they are hardly ever used
> that way.

Perhaps not, but they nevertheless remain a logical construct, not a 
physical construct.

-- Ron
Received on Wednesday, 9 June 1999 05:56:49 UTC