- From: Tim Bray <tbray@textuality.com>
- Date: Fri, 21 Jun 2002 17:31:51 -0700
- To: www-tag@w3.org
Recently, I proposed a TAG issue under the (unfortunate) name "PSVI Considered Harmful". This kicked off lot of discussion on the www-tag mailing list, and the TAG has subsequently accepted the issue. This has led to further discussion over in the xml-schema-wg, and I think we've made enough progress to try to lay out what the architectural issues are, along with some proposed solutions. My understanding of the issues has been helped immensely by thoughtful contributions from (among others) Noah Mendelson, Dave Ezell, and Mary Holstege which I don't cite here because some of them are in member-only space; however, I do not claim that any of these contributors agree with this note. 1. XML Schema Validation generates information Validation takes as input an XML instance and one or more XML Schema instances, and produces potentially a lot of output. This includes: - whether the instance is valid - whether each element and attribute are valid - details about the validation process, e.g. this attribute is valid because it's a union type, one of the options is integer, and it qualifies as an integer - schema types of elements and attributes - elements and attributes that are defaulted, i.e. not actually present in the instance Currently, all of this stuff is lumped together and placed in the "PSVI". 2. Use of PSVI contents The XML Schema WG is currently engaged in investigating which pieces of the PSVI are of potential interest and assembling use cases. Presumably if it emerges that there is wide interest in access to particular PSVI items, someone will have to take on the work of publishing an API and serialization for them. 3. The PSVI contents are heterogeneous The PSVI's contents have the sole defining characteristic that they are generated as a result of schema validation. It's hard to think of any other meaningful shared characteristic. The way we talk about types is different from the way we talk about validation outcomes is different from the basic elements-and-attributes additions by defaulting. 4. Do the PSVI contents belong in the infoset? Clearly the element and attribute items produced by defaulting are (logically) just like other elements and attributes, and the infoset is pre-cooked to accept them, so it seems like the infoset is a good place to put them. On the other hand, it's not obvious that the infoset's framework of "items" and "properties" is a good way to describe things like validation outcomes and type information. Let's assume we decide that some of this stuff needs to be made available to other parties - is it a useful or necessary step to go through the infoset to get there? I'm not being rhetorical here, this is just not obvious to me. 5. The PSVI type information is itself heterogeneous This falls naturally out of the richness of the XML Schema type system. As someone (I think Noah) pointed out, it's easy to imagine sharing the semantics of built-in primitive types across a broad spectrum of specifications and applications (e.g., "this is an integer"). It's plausible but not as obvious to think about sharing restrictions of primitive types (e.g. "this is an integer greater than 3"). The notion of sharing complex and derived types starts to get pretty hairy pretty fast - anything that did this would have to have the semantics of XML Schema wired in pretty deeply. I'll be interested to see if there are use cases for sharing the semantics of complex types outside of the validation application. 6. Type naming is tricky This falls naturally out of the previous point. XML Schema (correct me if I'm wrong) allows its types to be identified by qname. But the semantics that come with saying "this is an xsi:int" are obviously wildly different from some complex type that's been through several levels of derivation. In particular, the former are widely shareable without knowledge of XML Schema semantics. 7. Type information is useful outside of validation applications There is an existence proof for this: XML Query. Queries can make use both of type names for matching elements and attributes, and of particular type semantics (ordering and equality) for matching character data. It's not hard to imagine lots of other use-cases. 8. Why not standardize on XML Schema's primitive data types? XQuery (and I suspect many other facilities) are going to find it essential to hard-wire in the semantics of primitive types (numbers, dates, URIs). W3C has invested a huge amount of effort in building a primitive-type system as a part of XML Schema. I personally think it's too big and some gHorribleKludge types got in, but they're done and stable and I don't see any reason why they shouldn't serve as a basis for XQuery and anyone else who needs this kind of thing. Question: are the specs well-enough modularized that it's easy to normatively reference in basic types by reference? Proposal: let's issue a TAG finding saying that if you need primitive data types, use XML Schema's, don't invent your own. Question: For things that are this widely shareable, I think it's architecturally essential to have actual URIs, not just qnames; is this hard to achieve? 9. Type names and type semantics exist independent of schemas Let's consider an example; a system where large numbers of business transactions are encoded in XML and interchanged and stored in a database, and need to be accessed by XQuery. In this particular case, schema validation is not done at run-time, all parties do application-specific validation and trust each other to encode numeric types and dates correctly. Much of the markup is generated like this: fprintf(xmlStream, "<detail unitPrice='%.2f' quantity='%3d'/>", unitP, quant); The XQuery processor that's accessing this database will know from some sort of data dictionary implementation that a <detail> element has unitPrice= and quantity= attributes, and the primitive data types of each attribute. While it uses primitive type names from XML Schema, it is possible in principle and plausible in practice that no schema has ever been written, let alone applied. Note that in this case the type information is found neither in the schema (because there isn't one) nor in the instance. This doesn't in the slightest get in the way of, for example, XQuery semantics. 10. Coupling specs to PSVI as it exists today is architecturally unsound The PSVI is a grab-bag of stuff that's defined as being the outcome of a particular operation; any attempt to pretend that all its contents can be talked about, addressed, or used in a uniform way is just misguided. Also it needs to be crystal-clear that you can have types without having a schema or doing validation. Proposal: Let's issue a TAG finding that types ought to be addressible by name, and work with some WG to write architectural principles for naming them, (qname or URI or both); getting this right is nontrivial, see item 6 above. XQuery seems like a good example of a sensible way to use these names. ========== Conclusion Where I'd like to end up is is: - we have a list of well-known base types with well-known names that everybody uses consistently - we have a generic naming system for types including complex types - we have a well-defined API and serialization for those parts of the output of the schema validation process that are used in non-validation applications - we have multiple different schema facilities aimed at different kinds of applications, which however exhibit consistency in (a) their use of base primitive types, (b) the way they name types, and (c) the way they expose their output to the world. This seems achievable and not all that ambitious. -Tim
Received on Friday, 21 June 2002 20:31:50 UTC