RE: Component-Based Schema Design from Mark Feblowitz on 2002-12-27 (xmlschema-dev@w3.org from December 2002)

From: Mark Feblowitz <mfeblowitz@frictionless.com>
Date: Fri, 27 Dec 2002 11:23:27 -0500
To: "'Roger L. Costello'" <costello@mitre.org>
Cc: "Xmlschema-Dev (E-mail)" <xmlschema-dev@w3.org>
Message-ID: <4DBDB4044ABED31183C000508BA0E97F040AC2D0@FCPOSTAL>
Roger - 

This is a very compelling idea. It is similar to OAGIS "components"
(although more limited in size and scope). I'm certain that there are other
similar attempts to appropriately chunk schemas for manageability,
usability, etc. Another that comes to mind is the ebXML Core Components
work. Again, similar aims, but the chunks are larger in size and scope.
There's  currently a project to "harmonize" OAGIS Components and ebXML Core
Components.

I understand the motivation for the chunks, having struggled multiple times
with large, inflexible, overly structured schemas. I'm just not sure that
the size and independence characteristics will support realistic schemas,
except for some subset of extremely simple objects.

Of course, such an approach would require innovations in parsing
technologies, since the loading and processing of what could be hundreds of
schemas for a reasonably sized xml document would  be prohibitive. There are
a few standards out there that essentially have one schema file per chunk,
and they are notoriously slow to be validated. Extra machinery such as a
schema repository or pre-assembly of the full collection of chunk schemas
would be required.

Another down side of this approach is the management of similar, derived
concepts. For concept A' to be derived from concept A, either the schema for
A' must be dependent on the schema for A, or the information content from A
must be replicated in A', and we all know how difficult it is to maintain
definitions that result from replication (especially those who've struggled
with derivation by restriction on any reasonable scale). This is another
area where tool support might help - for example, dependent schemas could
easily be used to generate independent schemas.
 

Mark


Mark Feblowitz
mfeblowitz@frictionless.com 
MarkFeblowitz@attbi.com
w: 617-715-7231
h: 781-721-2729
m: 781-789-5478

 -----Original Message-----
From: 	Roger L. Costello [mailto:costello@mitre.org] 
Sent:	Friday, December 27, 2002 10:57 AM
To:	xmlschema-dev@w3.org
Cc:	Costello,Roger L.
Subject:	Component-Based Schema Design


Hi Folks,

INTEROPERABILITY VIA "SCHEMA CHUNKS"

I have become convinced that the key to interoperability is to promote
the use of broadly adopted "schema chunks".  I would like to hear your
thoughts on how to design interoperable schema chunks.

DEFINITION OF "SCHEMA CHUNK"

First, let me start by defining what I mean by a "schema chunk".  I will
provide a more detailed definition later, but for now: 

   A schema chunk is a schema with a narrow, well-defined purpose.

   Example. A "position" schema chunk has a very narrow scope - it
   defines the format of position data: lat, lon, msrmt accuracy,
   and id.

PROPERTIES OF SCHEMA CHUNKS

A schema chunk has certain properties: 

(a) a unique identifier
(b) no dependencies (that is, the schema chunk is standalone)

Thus, a schema chunk represents a reusable component.

PARTIAL VALIDATION AND INTEROPERABILITY

A schema chunk should enable partial validation.  A colleague recently
helped me to realize the importance of partial validation of an instance
document, and the role of partial validation in interoperability. 

DEFINITION OF PARTIAL VALIDATION

Oftentimes you will receive an instance document and you need only a
portion of the data.  Thus, you would like to validate just that
portion, extract it, and process it.  

Here are a couple of examples where partial validation plays an
important role:

EXAMPLE - EXTRACT/PROCESS THE TARGET POSITION CHUNK

A pilot is handed a floppy containing a document that contains, among
other things, the position of a target to be bombed.  He inserts
the floppy into his on-board computer, which has a cached copy of the
position schema chunk.  The computer validates the position data,
extracts it, and loads the coordinates into the ordinance.  The other
information on the floppy is irrelevant, and couldn't be validated even
if desired since the pilot has no connection to a network.

EXAMPLE - PIPELINE PROCESSING OF DATA CHUNKS

Imagine a document that gets sent through a series of stages.  Each
stage acts like a filter, validating the data chunk that is pertinent to
that stage, extracting (removing) it, processing it, and then passing
the modified document downstream to the next stage.

HOW TO DO PARTIAL VALIDATION

You may ask: "How do I perform partial validation?"  Answer: In the
instance document don't specify schemaLocation.  Then, at validation
time you must supply namespace/schema-URL values.  To do partial
validation provide just the namespace/schema-URL pair of the component
that you are interested in validating.

DESIGNING A SCHEMA CHUNK

Before I unveil my ideas on how to design schema chunks, let's consider
the implications of what I have stated above:

NARROW, WELL-DEFINED PURPOSE

This implies that the schema chunks be small, i.e., contains a small
number of elements.

UNIQUE IDENTIFIER

A schema chunk is identified by its targetNamespace.  To give each chunk
a unique identifier implies that each chunk go into a different schema,
and each schema have a different targetNamespace.  That is, one chunk,
one schema.

DECOUPLED

Each chunk must be standalone, self-contained, with no dependencies on
other schemas.  This means no importing/including of other schemas.

PROPOSED SCHEMA CHUNK DESIGN

Below is my proposal on how to design schema chunks to promote
reusability and interoperability.

First, here's my expanded definition of "schema chunk":

- a globally declared element comprised of 5-10 in-lined child elements.
- a chunk represents a well-defined chunk of information. 
- a chunk has a unique identifier - the targetNamespace.  
- a chunk is broadly useable.  
- one schema, one chunk.  That is, a schema just defines one chunk.
- a chunk has no dependencies on other schemas 
   a. Use element declarations, simpleType definitions.
   b. Don't use derived types, substitutionGroups.
   c. Use the Russian Doll design.

So, here's my proposal of how a schema chunk should be designed:

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            targetNamespace="schema-chunk-id"
            elementFormDefault="qualified">
    <xsd:element name="chunk">
        <xsd:complexType>
            <xsd:sequence>
                <xsd:element name="child-element1" type="simpleType-1>
                <xsd:element name="child-element2" type="simpleType-2>
                <xsd:element name="child-element3" type="simpleType-3>
                <xsd:element name="child-element4" type="simpleType-4>
                <xsd:element name="child-element5" type="simpleType-5>
                </xsd:element>
            </xsd:sequence>
        </xsd:complexType>
    </xsd:element>

    <xsd:simpleType name="simpleType-1">...</simpleType>
    <xsd:simpleType name="simpleType-2">...</simpleType>
    <xsd:simpleType name="simpleType-3">...</simpleType>
    <xsd:simpleType name="simpleType-4">...</simpleType>
    <xsd:simpleType name="simpleType-5">...</simpleType>

</xsd:schema>

Note that with this design:

a. It defines one chunk:

     <chunk>
         ...
     </chunk>

b. The chunk has a unique id - defined by the targetNamespace.

c. The data is strongly type - a simpleType for each data item.

d. The chunk is small - just 5 data items.

e. The chunk is bounded - the child elements are in-lined, using the
Russian Doll design.

f. The chunk is standalone - all simpleTypes needed to define the chunk
are bundled in the schema.  Everything that is needed to use and
understand the schema is right there.  No need to look through a long
type hierarchy chain, no need to examine other schemas.

GLUE SCHEMAS

I have become a believer in schema design using schema chunks.  The
major emphasis in schema design should, I believe, be on creating and
reusing schema chunks.  The purpose of the "other" (non-chunk) schemas
is to simply glue together the schema chunks.

As an aside, I am beginning to believe that there is too much emphasis
on glue schemas.  The glue elements give the "illusion" of importance,
when, in fact, they have no importance other that to act as a framework
to hold the "real" data.

I am starting to believe that the right approach may be to empower
instance document authors to decide what collection of schema chunks
they wish to use, and let them glue them together using whatever
elements they wish.

CONCLUSIONS

To enable interoperability, I think that creating broadly adopted,
reusable schema chunks is very important.  So, how do we design broadly
adopted, reusable schema chunks?  In this message I have attempted to
outline what I see as an approach to designing schema chunks.  It
fundamentally advocates a Minimalist use of XML Schema functionality -
no type derivation, no element substitution, no import/include.

I welcome your comments and suggestions.  /Roger
Confidentiality Notice: This message, including any attachments, is intended
solely for the use of the individual to whom it is addressed and may contain
information that is privileged and confidential.  If you have received this
email in error, please delete it.  Any disclosure, copying or distribution
of this message is strictly prohibited.  Thank you.
Received on Friday, 27 December 2002 11:32:43 UTC