PSVI architectural discussion

Recently, I proposed a TAG issue under the (unfortunate) name "PSVI 
Considered Harmful".  This kicked off lot of discussion on the www-tag 
mailing list, and the  TAG has subsequently accepted the issue.  This 
has led to further discussion over in the xml-schema-wg, and I think 
we've made enough progress to try to lay out what the architectural 
issues are, along with some proposed solutions.  My understanding of the 
issues has been helped immensely by thoughtful contributions from (among 
others) Noah Mendelson, Dave Ezell, and Mary Holstege which I don't cite 
here because some of them are in member-only space; however, I do not 
claim that any of these contributors agree with this note.

1. XML Schema Validation generates information

Validation takes as input an XML instance and one or more XML Schema
instances, and produces potentially a lot of output.  This includes:

- whether the instance is valid
- whether each element and attribute are valid
- details about the validation process, e.g. this attribute is valid 
because it's a union type, one of the options is integer, and it 
qualifies as an integer
- schema types of elements and attributes
- elements and attributes that are defaulted, i.e. not actually present 
in the instance

Currently, all of this stuff is lumped together and placed in the "PSVI".

2. Use of PSVI contents

The XML Schema WG is currently engaged in investigating which pieces of 
the PSVI are of potential interest and assembling use cases.  Presumably 
if it emerges that there is wide interest in access to particular PSVI 
items, someone will have to take on the work of publishing an API and 
serialization for them.

3. The PSVI contents are heterogeneous

The PSVI's contents have the sole defining characteristic that they are 
generated as a result of schema validation.  It's hard to think of any 
other meaningful shared characteristic.  The way we talk about types is 
different from the way we talk about validation outcomes is different 
from the basic elements-and-attributes additions by defaulting.

4. Do the PSVI contents belong in the infoset?

Clearly the element and attribute items produced by defaulting are 
(logically) just like other elements and attributes, and the infoset is 
pre-cooked to accept them, so it seems like the infoset is a good place 
to put them.

On the other hand, it's not obvious that the infoset's framework of 
"items" and "properties" is a good way to describe things like 
validation outcomes and type information.  Let's assume we decide that 
some of this stuff needs to be made available to other parties - is it a 
useful or necessary step to go through the infoset to get there?  I'm 
not being rhetorical here, this is just not obvious to me.

5. The PSVI type information is itself heterogeneous

This falls naturally out of the richness of the XML Schema type system. 
  As someone (I think Noah) pointed out, it's easy to imagine sharing 
the semantics of built-in primitive types across a broad spectrum of 
specifications and applications (e.g., "this is an integer").  It's 
plausible but not as obvious to think about sharing restrictions of 
primitive types (e.g. "this is an integer greater than 3").  The notion 
of sharing complex and derived types starts to get pretty hairy pretty 
fast - anything that did this would have to have the semantics of XML 
Schema wired in pretty deeply.

I'll be interested to see if there are use cases for sharing the 
semantics of complex types outside of the validation application.

6. Type naming is tricky

This falls naturally out of the previous point.  XML Schema (correct me 
if I'm wrong) allows its types to be identified by qname.  But the 
semantics that come with saying "this is an xsi:int" are obviously 
wildly different from some complex type that's been through several 
levels of derivation.  In particular, the former are widely shareable 
without knowledge of XML Schema semantics.

7. Type information is useful outside of validation applications

There is an existence proof for this: XML Query.  Queries can make use 
both of type names for matching elements and attributes, and of 
particular type semantics (ordering and equality) for matching character 
data.  It's not hard to imagine lots of other use-cases.

8. Why not standardize on XML Schema's primitive data types?

XQuery (and I suspect many other facilities) are going to find it 
essential to hard-wire in the semantics of primitive types (numbers, 
dates, URIs).  W3C has invested a huge amount of effort in building a 
primitive-type system as a part of XML Schema.  I personally think it's 
too big and some gHorribleKludge types got in, but they're done and 
stable and I don't see any reason why they shouldn't serve as a basis 
for XQuery and anyone else who needs this kind of thing.

Question: are the specs well-enough modularized that it's easy to 
normatively reference in basic types by reference?

Proposal: let's issue a TAG finding saying that if you need primitive 
data types, use XML Schema's, don't invent your own.

Question: For things that are this widely shareable, I think it's 
architecturally essential to have actual URIs, not just qnames; is this 
hard to achieve?

9. Type names and type semantics exist independent of schemas

Let's consider an example; a system where large numbers of business 
transactions are encoded in XML and interchanged and stored in a 
database, and need to be accessed by XQuery.  In this particular case, 
schema validation is not done at run-time, all parties do 
application-specific validation and trust each other to encode numeric 
types and dates correctly.  Much of the markup is generated like this:

  fprintf(xmlStream, "<detail unitPrice='%.2f' quantity='%3d'/>",
          unitP, quant);

The XQuery processor that's accessing this database will know from some 
sort of data dictionary implementation that a <detail> element has 
unitPrice= and quantity= attributes, and the primitive data types of 
each attribute.  While it uses primitive type names from XML Schema, it 
is possible in principle and plausible in practice that no schema has 
ever been written, let alone applied.

Note that in this case the type information is found neither in the 
schema (because there isn't one) nor in the instance.  This doesn't in 
the slightest get in the way of, for example, XQuery semantics.

10. Coupling specs to PSVI as it exists today is architecturally unsound

The PSVI is a grab-bag of stuff that's defined as being the outcome of a 
particular operation; any attempt to pretend that all its contents can 
be talked about, addressed, or used in a uniform way is just misguided. 
  Also it needs to be crystal-clear that you can have types without 
having a schema or doing validation.

Proposal: Let's issue a TAG finding that types ought to be addressible 
by name, and work with some WG to write architectural principles for 
naming them, (qname or URI or both); getting this right is nontrivial, 
see item 6 above.  XQuery seems like a good example of a sensible way to 
use these names.

==========

Conclusion

Where I'd like to end up is is:

- we have a list of well-known base types with well-known names that 
everybody uses consistently
- we have a generic naming system for types including complex types
- we have a well-defined API and serialization for those parts of the 
output of the schema validation process that are used in non-validation 
applications
- we have multiple different schema facilities aimed at different kinds 
of applications, which however exhibit consistency in (a) their use of 
base primitive types, (b) the way they name types, and (c) the way they 
expose their output to the world.

This seems achievable and not all that ambitious.

  -Tim

Received on Friday, 21 June 2002 20:31:50 UTC