- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 13 Jun 2002 16:59:34 -0400
- To: "Tim Bray" <tbray@textuality.com>, www-tag@w3.org
I think there are many interesting issues on the table
in this discussion. One that seems to be getting
attention today follows from Tim Bray's question:
> I can imagine doing type annotation in a much more
> lightweight way than bringing a large complex
> declarative schema facility to bear. In fact, why
> shouldn't I just be able to jam something into the
> instance or infoset saying "this attribute here is an
> integer"?
I think we need to be careful about what sort of types
we mean, and where they come from. We need to be clear
on the distinction between knowing the name of type,
vs. knowing the meaning of the type. To motivate the
discussion, consider the following W3C xml schema
features (not defending W3C schemas, just food for
thought) :
1. A repertoire of built-in "simple types"
2. Named user derivations of those types (e.g. an
"ageType" type might be a restriction of
Integer to 0-120)
3. Complex types. These capture the similarity
between and common constraints on:
<width units="cm">15</width>
-and-
<height units="inch">25</height>
4. Locally scoped types and element declarations.
Question: in which cases and to what extent can we envision
doing what you suggest, I.e. provide some typing information
in self-describing XML documents, for which schema-validation
is not required?
1. Not much problem. If everyone agrees by fiat on a
repertoire of built-in types, and their definitions
are essentially considered to be known as part of
the system infrastructure, then just saying
<e xxx:type="yyy:Integer>123</e> makes element e
self-describing (I've avoided specific namespaces
to avoid unnecessary focus on one schema language
or another.)
2. Named derivations of simple types: it's easy
enough to carry type names around. For them to
mean anything, you probably need an agreed upon
framework for doing the derivations, and some sort
of common representation for the derivations. So
the need for a particular schema language starts
creeping in. Still, it's reasonable do something
like:
<ageAtGraduation xxx:type="zzz:ageType">
17
</ageAtGraduation>
The document is self-describing insofar as you
know the name of the ageType without doing
validation or referencing a schema. Anytime you
care what the type means (e.g that you've got a
subtype of Integer), you're likely to have a
dependency on an external schema in some schema
language.
3. Complex types. (Yes, these are very useful
e.g. for mappings to databases and programming
languages) Now we're into territory where the
whole point is that different schema languages are
likely to have different models of how contents
are constrained. Even the notion of such a type
is only natural in some schema languages.
Nonetheless, you can imagine the same sort of
approach as with ages above.
<width xxx:type="zzz:measurementType"
units="cm">
15
</width>
<height xxx:type="zzz:measurementType"
units="inch">
30
</height>
Without validation, you know the name of the type,
and you know that both elements have the same
type. To understand what a measurementType is,
you need some particular schema language, but
not necessarily to do validation. Only if you
want to be sure the document told the truth about
the type do you have to validate.
4. Locally scoped definitions and other context
sensitive constructions: I haven't thought this
through, but one needs to take some care in
building the constructions that allow one to have
self-describing locally-scoped content. I'd
rather not burn a lot of energy on that case here.
Surely it's something that languages like RELAX-NG
approach quite differently than W3C schema in any
case. I don't like the way XML Schema does this,
but it's a Recommendation, and I would not change it
now. I believe that the architecturally correct
way to have done local scoping would have been in
XML 1.0 and/or Namespaces. That way, you could
indeed tell by inspecting a document whether the
names of its elements were interpreted local to
parent (as attributes are) or not. I don't
propose to change that now either.
My point in listing the above cases is primarily to
clarify that knowing the name of type to which an
element/attribute claims to conform is very different
from having useful knowledge of what the type means,
which in turn is one step short of doing the validation
to prove the document didn't lie about the type (which
in turn is different from relying on the schema to
assert the type name in the first place.) In
considering Tim's challenge to build self-describing
documents, we need to consider the sort of types to be
supported, what kind of information various
applications will need, and where that information
should come from.
I like self-describing documents. I agree that, if we
develop improved conventions for self-describing
documents, it would be worthwhile to separately group:
the parts of the Infoset that you can know by just
looking at the document; those for which you need a
particular schema language but not necessarily
validation (e.g. the definition of the type to which
you claim to conform); and those that result from doing
a validation (this element is valid...the reason this
attribute is valid is that it matched a union type and
it matched the "Integer" part of the union...etc.) It
is also worth giving careful attention to those
abstractions that can reasonably be made to work across
schema languages vs., those which by their nature
depend on one schema language or another. The names of
types stand out as an obvious possibility, and perhaps
the definitions of primitive types. Derivations of
simple types seem less clear-cut.
Having defended self-describing documents, I think it's
worth considering the other side of the coin too...
Since a primary purpose of schemas is to be shared
across documents, there are important cases in which
it's not appropriate to put type definitions in the
document. I really do want to know in advance what the
100,000 purchase orders I get next year are going to
look like. I want to know which fields will be
subtypes of integers, and I want to know which ones are
measurements. I want to build databases to store them
and UIs to edit them. When the purchase orders come
in, it is very useful to know which parts of the
document map to which of the constructions I was
planning for, so for those purposes the PSVI is a good
thing. That doesn't mean we shouldn't encourage use of
self-describing documents wherever practical, just that
PSVI's are important too.
There's only so far you can go in building my purchase
order system systems without choosing one schema
language or another. Thus, while the web itself or XML
per se should not depend on any one language, it's not
surprising that a system like XML Query, or my database
XML integration layer, can do better with knowledge of
some particular type system. I have no objection to
minimizing unnecessary dependencies on particular
schema languages in our other W3C Recommendations, but
we should think hard before ruling out the dependencies
that bring real value. I suspect XML Query will be one
such case, and maybe XPath as well.
Thus, I do think it is a very good thing that a
language such as W3C XML schema takes the trouble to
carefully describe and formalize the information that's
known after a validation. Yes, a different language
might provide less information or more as a result of
validation (e.g. might tell you only about validity,
not defaults or types), but that's a nearly orthogonal
issue. What you do know, you should formalize, IMO.
The proposal to rule out PSVI's could be taken as
discouraging such formalization; I suspect that's not
quite what was intended.
Some of what you know from validation will necessarily
be schema-language specific. Thus, it's likely that
there will be a somewhat different PSVI for each schema
language. Layering those things that might also be
known before validation (e.g. type names) or factoring
those likely to be common across many schema languages
(primitive type names?) seems like a good thing, but
not in conflict the need to have a language-specific
PSVI for the rest.
By the way: SOAP Encoding provides a particular style
of self-describing XML without validation or dependence
on a particular schema language. See [1, 2]). A
non-normative Appendix discusses the options available
to applications that do wish to do W3C XML Schema
validation of SOAP messages. [3] (all links are to an
editors snapshot, which is stable in URI space--the
pertinent sections were added after the last WD was
published.)
I am avoiding the temptation for now to go into the
many other important issues raised by Tim's proposal.
This one seems to be the subject of many of today's
notes, so I've focussed on that.
With respect to W3C process: I do hope the Tag will
proceed with some care in situations where a proposed
architectural principle (e.g. no defaults, no PSVI)
conflicts with established Recommendations (in this
case XML 1.0 and Schemas).
Thank you very much for your patience with this long
note.
[1] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#soapenc
[2] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#enctypename
[3] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#encschema
------------------------------------------------------------------
Noah Mendelsohn Voice: 1-617-693-4036
IBM Corporation Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Thursday, 13 June 2002 17:18:02 UTC