RE: Potential new issue: PSVI considered harmful

I think there are many interesting issues on the table
in this discussion.  One that seems to be getting
attention today follows from Tim Bray's question:

> I can imagine doing type annotation in a much more
> lightweight way than bringing a large complex
> declarative schema facility to bear.  In fact, why
> shouldn't I just be able to jam something into the
> instance or infoset saying "this attribute here is an
> integer"?

I think we need to be careful about what sort of types
we mean, and where they come from.  We need to be clear
on the distinction between knowing the name of type,
vs. knowing the meaning of the type.  To motivate the
discussion, consider the following W3C xml schema
features (not defending W3C schemas, just food for 
thought) :

  1. A repertoire of built-in "simple types"
  2. Named user derivations of those types (e.g. an
     "ageType" type might be a restriction of 
     Integer to 0-120)
  3. Complex types.  These capture the similarity 
     between and common constraints on:

        <width units="cm">15</width>
                -and-
        <height units="inch">25</height>
  4. Locally scoped types and element declarations.

Question: in which cases and to what extent can we envision 
doing what you suggest, I.e. provide some typing information 
in self-describing XML documents, for which schema-validation
is not required?

 1.  Not much problem.  If everyone agrees by fiat on a
     repertoire of built-in types, and their definitions
     are essentially considered to be known as part of
     the system infrastructure, then just saying 
     <e xxx:type="yyy:Integer>123</e> makes element e
     self-describing (I've avoided specific namespaces
     to avoid unnecessary focus on one schema language
     or another.)

  2. Named derivations of simple types: it's easy
     enough to carry type names around.  For them to
     mean anything, you probably need an agreed upon
     framework for doing the derivations, and some sort
     of common representation for the derivations.  So
     the need for a particular schema language starts
     creeping in.  Still, it's reasonable do something
     like:

        <ageAtGraduation xxx:type="zzz:ageType">
            17
        </ageAtGraduation>

     The document is self-describing insofar as you
     know the name of the ageType without doing
     validation or referencing a schema.  Anytime you
     care what the type means (e.g that you've got a
     subtype of Integer), you're likely to have a
     dependency on an external schema in some schema
     language.

  3. Complex types.  (Yes, these are very useful
     e.g. for mappings to databases and programming
     languages) Now we're into territory where the
     whole point is that different schema languages are
     likely to have different models of how contents
     are constrained.  Even the notion of such a type
     is only natural in some schema languages.
     Nonetheless, you can imagine the same sort of
     approach as with ages above.

     <width xxx:type="zzz:measurementType" 
             units="cm">
                 15
     </width>
     <height xxx:type="zzz:measurementType" 
             units="inch">
                 30
     </height>


     Without validation, you know the name of the type,
     and you know that both elements have the same
     type.  To understand what a measurementType is,
     you need some particular schema language, but
     not necessarily to do validation.  Only if you
     want to be sure the document told the truth about
     the type do you have to validate. 
 
  4. Locally scoped definitions and other context
     sensitive constructions: I haven't thought this
     through, but one needs to take some care in
     building the constructions that allow one to have
     self-describing locally-scoped content.  I'd
     rather not burn a lot of energy on that case here.
     Surely it's something that languages like RELAX-NG
     approach quite differently than W3C schema in any
     case.  I don't like the way XML Schema does this, 
     but it's a Recommendation, and I would not change it
     now.  I believe that the architecturally correct
     way to have done local scoping would have been in
     XML 1.0 and/or Namespaces.  That way, you could
     indeed tell by inspecting a document whether the
     names of its elements were interpreted local to
     parent (as attributes are) or not.  I don't 
     propose to change that now either.

My point in listing the above cases is primarily to
clarify that knowing the name of type to which an
element/attribute claims to conform is very different
from having useful knowledge of what the type means,
which in turn is one step short of doing the validation
to prove the document didn't lie about the type (which
in turn is different from relying on the schema to
assert the type name in the first place.)  In
considering Tim's challenge to build self-describing
documents, we need to consider the sort of types to be
supported, what kind of information various
applications will need, and where that information
should come from.

I like self-describing documents.  I agree that, if we
develop improved conventions for self-describing
documents, it would be worthwhile to separately group:
the parts of the Infoset that you can know by just
looking at the document; those for which you need a
particular schema language but not necessarily
validation (e.g. the definition of the type to which
you claim to conform); and those that result from doing
a validation (this element is valid...the reason this
attribute is valid is that it matched a union type and
it matched the "Integer" part of the union...etc.)  It
is also worth giving careful attention to those
abstractions that can reasonably be made to work across
schema languages vs., those which by their nature
depend on one schema language or another.  The names of
types stand out as an obvious possibility, and perhaps
the definitions of primitive types.  Derivations of
simple types seem less clear-cut.

Having defended self-describing documents, I think it's
worth considering the other side of the coin too...

Since a primary purpose of schemas is to be shared
across documents, there are important cases in which
it's not appropriate to put type definitions in the
document.  I really do want to know in advance what the
100,000 purchase orders I get next year are going to
look like.  I want to know which fields will be
subtypes of integers, and I want to know which ones are
measurements.  I want to build databases to store them
and UIs to edit them.  When the purchase orders come
in, it is very useful to know which parts of the
document map to which of the constructions I was
planning for, so for those purposes the PSVI is a good
thing.  That doesn't mean we shouldn't encourage use of
self-describing documents wherever practical, just that
PSVI's are important too.

There's only so far you can go in building my purchase
order system systems without choosing one schema
language or another.  Thus, while the web itself or XML
per se should not depend on any one language, it's not
surprising that a system like XML Query, or my database
XML integration layer, can do better with knowledge of
some particular type system.  I have no objection to
minimizing unnecessary dependencies on particular
schema languages in our other W3C Recommendations, but
we should think hard before ruling out the dependencies
that bring real value.  I suspect XML Query will be one
such case, and maybe XPath as well.

Thus, I do think it is a very good thing that a
language such as W3C XML schema takes the trouble to
carefully describe and formalize the information that's
known after a validation.  Yes, a different language
might provide less information or more as a result of
validation (e.g. might tell you only about validity,
not defaults or types), but that's a nearly orthogonal
issue.  What you do know, you should formalize, IMO.
The proposal to rule out PSVI's could be taken as
discouraging such formalization;  I suspect that's not
quite what was intended.

Some of what you know from validation will necessarily
be schema-language specific.  Thus, it's likely that
there will be a somewhat different PSVI for each schema
language.  Layering those things that might also be
known before validation (e.g. type names) or factoring
those likely to be common across many schema languages
(primitive type names?) seems like a good thing, but
not in conflict the need to have a language-specific
PSVI for the rest.

By the way: SOAP Encoding provides a particular style
of self-describing XML without validation or dependence
on a particular schema language.  See [1, 2]).  A
non-normative Appendix discusses the options available
to applications that do wish to do W3C XML Schema
validation of SOAP messages. [3] (all links are to an
editors snapshot, which is stable in URI space--the
pertinent sections were added after the last WD was
published.)

I am avoiding the temptation for now to go into the
many other important issues raised by Tim's proposal.
This one seems to be the subject of many of today's
notes, so I've focussed on that.

With respect to W3C process:  I do hope the Tag will 
proceed with some care in situations where a proposed 
architectural principle (e.g. no defaults, no PSVI) 
conflicts with established Recommendations (in this 
case XML 1.0 and Schemas).

Thank you very much for your patience with this long
note.

[1] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#soapenc
[2] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#enctypename
[3] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#encschema


------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------

Received on Thursday, 13 June 2002 17:18:02 UTC