- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 13 Jun 2002 16:59:34 -0400
- To: "Tim Bray" <tbray@textuality.com>, www-tag@w3.org
I think there are many interesting issues on the table in this discussion. One that seems to be getting attention today follows from Tim Bray's question: > I can imagine doing type annotation in a much more > lightweight way than bringing a large complex > declarative schema facility to bear. In fact, why > shouldn't I just be able to jam something into the > instance or infoset saying "this attribute here is an > integer"? I think we need to be careful about what sort of types we mean, and where they come from. We need to be clear on the distinction between knowing the name of type, vs. knowing the meaning of the type. To motivate the discussion, consider the following W3C xml schema features (not defending W3C schemas, just food for thought) : 1. A repertoire of built-in "simple types" 2. Named user derivations of those types (e.g. an "ageType" type might be a restriction of Integer to 0-120) 3. Complex types. These capture the similarity between and common constraints on: <width units="cm">15</width> -and- <height units="inch">25</height> 4. Locally scoped types and element declarations. Question: in which cases and to what extent can we envision doing what you suggest, I.e. provide some typing information in self-describing XML documents, for which schema-validation is not required? 1. Not much problem. If everyone agrees by fiat on a repertoire of built-in types, and their definitions are essentially considered to be known as part of the system infrastructure, then just saying <e xxx:type="yyy:Integer>123</e> makes element e self-describing (I've avoided specific namespaces to avoid unnecessary focus on one schema language or another.) 2. Named derivations of simple types: it's easy enough to carry type names around. For them to mean anything, you probably need an agreed upon framework for doing the derivations, and some sort of common representation for the derivations. So the need for a particular schema language starts creeping in. Still, it's reasonable do something like: <ageAtGraduation xxx:type="zzz:ageType"> 17 </ageAtGraduation> The document is self-describing insofar as you know the name of the ageType without doing validation or referencing a schema. Anytime you care what the type means (e.g that you've got a subtype of Integer), you're likely to have a dependency on an external schema in some schema language. 3. Complex types. (Yes, these are very useful e.g. for mappings to databases and programming languages) Now we're into territory where the whole point is that different schema languages are likely to have different models of how contents are constrained. Even the notion of such a type is only natural in some schema languages. Nonetheless, you can imagine the same sort of approach as with ages above. <width xxx:type="zzz:measurementType" units="cm"> 15 </width> <height xxx:type="zzz:measurementType" units="inch"> 30 </height> Without validation, you know the name of the type, and you know that both elements have the same type. To understand what a measurementType is, you need some particular schema language, but not necessarily to do validation. Only if you want to be sure the document told the truth about the type do you have to validate. 4. Locally scoped definitions and other context sensitive constructions: I haven't thought this through, but one needs to take some care in building the constructions that allow one to have self-describing locally-scoped content. I'd rather not burn a lot of energy on that case here. Surely it's something that languages like RELAX-NG approach quite differently than W3C schema in any case. I don't like the way XML Schema does this, but it's a Recommendation, and I would not change it now. I believe that the architecturally correct way to have done local scoping would have been in XML 1.0 and/or Namespaces. That way, you could indeed tell by inspecting a document whether the names of its elements were interpreted local to parent (as attributes are) or not. I don't propose to change that now either. My point in listing the above cases is primarily to clarify that knowing the name of type to which an element/attribute claims to conform is very different from having useful knowledge of what the type means, which in turn is one step short of doing the validation to prove the document didn't lie about the type (which in turn is different from relying on the schema to assert the type name in the first place.) In considering Tim's challenge to build self-describing documents, we need to consider the sort of types to be supported, what kind of information various applications will need, and where that information should come from. I like self-describing documents. I agree that, if we develop improved conventions for self-describing documents, it would be worthwhile to separately group: the parts of the Infoset that you can know by just looking at the document; those for which you need a particular schema language but not necessarily validation (e.g. the definition of the type to which you claim to conform); and those that result from doing a validation (this element is valid...the reason this attribute is valid is that it matched a union type and it matched the "Integer" part of the union...etc.) It is also worth giving careful attention to those abstractions that can reasonably be made to work across schema languages vs., those which by their nature depend on one schema language or another. The names of types stand out as an obvious possibility, and perhaps the definitions of primitive types. Derivations of simple types seem less clear-cut. Having defended self-describing documents, I think it's worth considering the other side of the coin too... Since a primary purpose of schemas is to be shared across documents, there are important cases in which it's not appropriate to put type definitions in the document. I really do want to know in advance what the 100,000 purchase orders I get next year are going to look like. I want to know which fields will be subtypes of integers, and I want to know which ones are measurements. I want to build databases to store them and UIs to edit them. When the purchase orders come in, it is very useful to know which parts of the document map to which of the constructions I was planning for, so for those purposes the PSVI is a good thing. That doesn't mean we shouldn't encourage use of self-describing documents wherever practical, just that PSVI's are important too. There's only so far you can go in building my purchase order system systems without choosing one schema language or another. Thus, while the web itself or XML per se should not depend on any one language, it's not surprising that a system like XML Query, or my database XML integration layer, can do better with knowledge of some particular type system. I have no objection to minimizing unnecessary dependencies on particular schema languages in our other W3C Recommendations, but we should think hard before ruling out the dependencies that bring real value. I suspect XML Query will be one such case, and maybe XPath as well. Thus, I do think it is a very good thing that a language such as W3C XML schema takes the trouble to carefully describe and formalize the information that's known after a validation. Yes, a different language might provide less information or more as a result of validation (e.g. might tell you only about validity, not defaults or types), but that's a nearly orthogonal issue. What you do know, you should formalize, IMO. The proposal to rule out PSVI's could be taken as discouraging such formalization; I suspect that's not quite what was intended. Some of what you know from validation will necessarily be schema-language specific. Thus, it's likely that there will be a somewhat different PSVI for each schema language. Layering those things that might also be known before validation (e.g. type names) or factoring those likely to be common across many schema languages (primitive type names?) seems like a good thing, but not in conflict the need to have a language-specific PSVI for the rest. By the way: SOAP Encoding provides a particular style of self-describing XML without validation or dependence on a particular schema language. See [1, 2]). A non-normative Appendix discusses the options available to applications that do wish to do W3C XML Schema validation of SOAP messages. [3] (all links are to an editors snapshot, which is stable in URI space--the pertinent sections were added after the last WD was published.) I am avoiding the temptation for now to go into the many other important issues raised by Tim's proposal. This one seems to be the subject of many of today's notes, so I've focussed on that. With respect to W3C process: I do hope the Tag will proceed with some care in situations where a proposed architectural principle (e.g. no defaults, no PSVI) conflicts with established Recommendations (in this case XML 1.0 and Schemas). Thank you very much for your patience with this long note. [1] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#soapenc [2] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#enctypename [3] http://www.w3.org/2000/xp/Group/2/06/06/soap12-part2.html#encschema ------------------------------------------------------------------ Noah Mendelsohn Voice: 1-617-693-4036 IBM Corporation Fax: 1-617-693-8676 One Rogers Street Cambridge, MA 02142 ------------------------------------------------------------------
Received on Thursday, 13 June 2002 17:18:02 UTC