- From: <noah_mendelsohn@us.ibm.com>
- Date: Tue, 24 Dec 2002 14:36:57 -0500
- To: Steven Taschuk <staschuk@telusplanet.net> (by way of "C. M. Sperberg-McQueen" <<staschuk@telusplanet.net<staschuk>)
- Cc: www-xml-schema-comments@w3.org
Steven Taschuk writes: > > _Part 2: Datatypes_ defines canonical lexical > representations for most of the built-in simple types, > but their use is unclear. I'd like to see some > amplification on this point in 1.1. Canonical representations are used in the structures rec. in conjunction with setting default values, though we are considering eliminating that dependence eventually. In general, canonical forms are a convenience to those who: a) May wish to build implementations that start with a value and eventually serialize to characters. While such implementations are free to use any lexical form in principle, it was felt that a suggested preferred form would be helpful in promoting the development of tools such as shared libraries, in maximimzing the proliferations of forms that users will be comfortable reading, etc. b) Even when schemas itself makes no use of canonical forms, other specifications may do so. Thus, we are providing a building block which other specifications can use to maximimize interoperability. > Trolling through the archives, I find a suggestion that > canonicalization is useful in the context of signed > XML, when intermediate parties in a transaction might > replace one lexical representation with a different but > equivalent one, and it is desired that this not > invalidate the signature. This is a worthwhile goal, > but it seems impossible to canonicalize a document > without special knowledge of every type in the > document. Hard to comment without seeing the note in question. My understanding is that the existing c14n proposals for signatures (which are not tied at all to the notion of canonical lexical forms in schema datatypes) deal with well-formed XML, and do not consider alternate representations of the same type to be in the same equivalence class. I have seen some semi-formal proposals to do schema-aware c14ns. Those would have exactly the pros and cons you suggest. Specifically, such a c14n would support signatures in cases where you truly do not care that a float: 100 has been rewritten as 1.0E+2 The fact is, there are some applications for which you do NOT want the signature to match on the above; you want to know that someone has tampered with your document. Today's c14ns do the right thing for those. One can imagine other systems in which the rewrite above is considered harmless, and in which you want the signature to match. You might imagine yet another where providing a value explicitly in the instance is equivalent to letting it default to the same value. In the end, you have to sign what you care about, and there's no obvious limit to the things that one user or another may want. I think the W3C can at best standardize c14n conventions for some of the most common use cases. > For a silly example, consider the type > <simpleType name='onTheHour'> > <restriction base='dateTime'> > <pattern value='.*T..:00.*'/> > </restriction> > </simpleType> > > which requires the minute field of its values to be > zero. Canonicalizing values of this type in general is > impossible without special knowledge of the type: an > algorithm for canonicalizing dateTimes in general > cannot be used since conversion of an onTheHour value > to UTC might change the minutes field and make the > result invalid for onTheHour. Now you're raising a different point, I think. It is possible to create restrictions that eliminate the canonical forms for some values. Some of us in the WG have been nervous since day 1 about pattern restrictions that operate on the lexical space, but users do seem to want them. The WG has given some attention to the possible need to clarify the rec in this area, but I don't think we are headed toward eliminating the possibility of such a restriction. Of course, if you were to build a c14n/signature system that depended on such restrictions having canonical form, it would not work. That's a second order effect. No application that depended on the canonical forms for such a type would work. > So, if canonical lexical representations cannot be used > by a generic processor to canonicalize a document, then > what are they for? Only the processors with special > knowledge? The can be used in the many applications where code starts with a value and wishes to create a reasonably useful lexical form (as opposed to 00000000000001 for the integer 1). As you observe, for better or worse, some simple types do not have canonical forms today. Applications using such types can't get the benefit. It's in general a theorem-proving exercise to see whether a value assigned to a datatype with pattern facet restrictions has ANY legal lexical form, much less a canonical one. > While I'm at it, why isn't canonical form a facet of > the type? IMO, because you can't alter or depend onthe canonical form when creating restrictions. I think it would be plausible to have a "useOnlyCanonical" facet that a restriction could set to "true", which would have the effect of a pattern that matched all-and-only the canonical forms for a type. On the other hand, that's creeping featurism in an already complicated spec. > Incidentally, the above example, silly as it is, > illustrates an important respect in which values of a > type derived by restriction cannot be treated by a > generic processor as values of the base type. It is a > bit surprising that there are any such respects at all > (if, like me, you are coming from an object-oriented > view of "type"); I think this point deserves some > commentary in 1.1. I don't understand. You're "onTheHour" times aren't legal as both lexical and value space forms for xsd:dateTime? Thanks. Noah ------------------------------------------------------------------ Noah Mendelsohn Voice: 1-617-693-4036 IBM Corporation Fax: 1-617-693-8676 One Rogers Street Cambridge, MA 02142 ------------------------------------------------------------------
Received on Tuesday, 24 December 2002 14:41:06 UTC