[abstractComponentRefs-37] Schema components and schema documents

As you may or may not be aware, the Schema recommendation draws an 
important architectural distinction between what it calls schema 
components, and so-called schema documents.  In private communication, Tim 
Bray suggested that the TAG might welcome a bit of explanation of this 
distinction, so here's an attempt.  I'm not advocating anything here, 
merely explaining how the recommendation works, in the hopes of informing 
your deliberations on "ComponentRefs" issue.  Also, I'm writing for 
myself, not for the Schema WG.

Schema Components and Schema Documents
--------------------------------------

Schema is fundamentally defined at what's called the "component" level[1]. 
Components are abstractions, much like Infoset information items.  They 
tell you the information you need to know, not the form in which it's 
represented.  So, for example, in order to have a declaration for a 
derived simple type you need to know the name of the new type, the base 
type from which it's derived, which facets are changed (e.g. 
maxInclusive), whether the new type is "final", and so on.  The core of 
the schema recommendation doesn't require you to put that information in 
any particular form.  For example, it might live in memory behind some 
"createSimpleType" API, perhaps having been dynamically created by some 
database.  In any case, the schema recommendation describes the result of 
a validation, regardless of the form in which the type information is 
stored.

What most people think of as a schema is what the recommendation calls a 
"schema document"[2].  A schema document sets out a normative XML 
representation for schema information.  Note, however, that a component 
may draw on information from several documents.  For example, you might 
create a type in one namespace, and I might derive a type in another 
namespace using yours as a base.  Since the schema recommendation uses one 
document per namespace, the derived type and the base are necessarily set 
out in different schema documents.  Note that the derived component 
includes information from the base...once derived it stands on its own, 
and has copies of some information from the base.  Thus you cannot in 
general put together a schema component by reading a single schema 
document.  Furthermore, it is quite possible for a component declared in a 
schema document to derive from or be the base for or otherwise use another 
that is defined through some non-document means.  For example, an HTML 
editor could build into a special validator the definitions for the HTML 
namespace, but could allow schema documents to build content models that 
use those HTML components. 

In summary, schema documents are the most common way of setting out a 
schema, but not the only one, and you tend to need multiple documents to 
pull together a component.  There is in general not a single schema 
document that can represent a component involving multiple namespaces.  As 
with synthetic infosets, you can create perfectly useable schemas in 
memory with some API, or in a database, without using the <schema> XML 
document form.  All that's required is that your processor understand the 
form in which you have stored the various definitions and declarations. Of 
course, the most commonly available processors read schema documents in 
the usual XML form, and we call out a level of conformance for those 
procssors that do[3]. 

Identifiers for Declarations and Definitions
--------------------------------------------

Regarding the TAG issue on references to schema definitions and 
declarations:  it is coherent to consider identifiers for the markup in a 
schema document, or for the components in a schema, but the abstractions 
identified are surely very different.  The actual "type" that you've 
derived is not unambiguously determinable from the the document in which 
the derivation is set out.  For example, if someone fixed a bug in the 
base type, the derived type would change. 

I think it's fair to say that the schema workgroup has informally 
concluded that it's schema components for which identities are most 
urgently needed.  On the other hand, for the reasons discussed above, 
component definitions do not in general follow from individual schema 
documents;  you typically need to assemble (as a validator would) a 
self-consistent set of schema documents for the namespace(s) being 
validated in order to know what components you have.  Approaches that 
attempt to use namespaces as the basis for a component name tend not to 
handle the cases where information is drawn from multiple namespaces, or 
to deal with the possibility that multiple schema documents are out there 
describing the same namespace (perhaps due to bug fixes or whatever.) 

Accordingly, the schema WG is working hard to find the right levels ways 
of handing component identity.  Interestingly, I personally think the 
answer might well be to use some RDDL or similar document to identify 
collections of schema documents, and to serve as the basis for creating 
identifiers for the components (such as type declarations) represented by 
those documents in combination.  If you are interested, the Schema WG has 
a subgroup wrestling with this...but I am not on it.  Check with Michael 
Sperberg-McQueen, our chair.

Is WSDL Similar to Schema?
--------------------------

Jonathan Marsh tells me that WSDL has indeed copied our component/document 
distinction, but my impression is that their workgroup has as yet put less 
energy into focusing on the subtleties that result.  It's also possible 
that WSDL declarations are somewhat more orthogonal than Schemas, so there 
may be a more direct mapping between lexical and conceptual forms (I.e. 
they may be closer to having each component being fully set out in a 
single document.  I'm not familiar with the sorts of importation and 
derivation across namespaces that they allow.)

Anyway, I hope this is helpful to the TAG in its consideration of the 
abstractComponentRefs issue.

Noah

[1] http://www.w3.org/TR/xmlschema-1/#key-component
[2] http://www.w3.org/TR/xmlschema-1/#key-schemaDoc
[3] http://www.w3.org/TR/xmlschema-1/#key-interchange

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------

Received on Friday, 11 April 2003 22:38:32 UTC