- From: <noah_mendelsohn@us.ibm.com>
- Date: Tue, 24 Dec 2002 14:36:57 -0500
- To: Steven Taschuk <staschuk@telusplanet.net> (by way of "C. M. Sperberg-McQueen" <<staschuk@telusplanet.net<staschuk>)
- Cc: www-xml-schema-comments@w3.org
Steven Taschuk writes:
>
> _Part 2: Datatypes_ defines canonical lexical
> representations for most of the built-in simple types,
> but their use is unclear. I'd like to see some
> amplification on this point in 1.1.
Canonical representations are used in the structures rec. in conjunction
with setting default values, though we are considering eliminating that
dependence eventually. In general, canonical forms are a convenience to
those who:
a) May wish to build implementations that start with a value and
eventually serialize to characters. While such implementations are free
to use any lexical form in principle, it was felt that a suggested
preferred form would be helpful in promoting the development of tools such
as shared libraries, in maximimzing the proliferations of forms that users
will be comfortable reading, etc.
b) Even when schemas itself makes no use of canonical forms, other
specifications may do so. Thus, we are providing a building block which
other specifications can use to maximimize interoperability.
> Trolling through the archives, I find a suggestion that
> canonicalization is useful in the context of signed
> XML, when intermediate parties in a transaction might
> replace one lexical representation with a different but
> equivalent one, and it is desired that this not
> invalidate the signature. This is a worthwhile goal,
> but it seems impossible to canonicalize a document
> without special knowledge of every type in the
> document.
Hard to comment without seeing the note in question. My understanding is
that the existing c14n proposals for signatures (which are not tied at all
to the notion of canonical lexical forms in schema datatypes) deal with
well-formed XML, and do not consider alternate representations of the same
type to be in the same equivalence class. I have seen some semi-formal
proposals to do schema-aware c14ns. Those would have exactly the pros and
cons you suggest. Specifically, such a c14n would support signatures in
cases where you truly do not care that a float:
100
has been rewritten as
1.0E+2
The fact is, there are some applications for which you do NOT want the
signature to match on the above; you want to know that someone has
tampered with your document. Today's c14ns do the right thing for those.
One can imagine other systems in which the rewrite above is considered
harmless, and in which you want the signature to match. You might imagine
yet another where providing a value explicitly in the instance is
equivalent to letting it default to the same value. In the end, you have
to sign what you care about, and there's no obvious limit to the things
that one user or another may want. I think the W3C can at best
standardize c14n conventions for some of the most common use cases.
> For a silly example, consider the type
> <simpleType name='onTheHour'>
> <restriction base='dateTime'>
> <pattern value='.*T..:00.*'/>
> </restriction>
> </simpleType>
>
> which requires the minute field of its values to be
> zero. Canonicalizing values of this type in general is
> impossible without special knowledge of the type: an
> algorithm for canonicalizing dateTimes in general
> cannot be used since conversion of an onTheHour value
> to UTC might change the minutes field and make the
> result invalid for onTheHour.
Now you're raising a different point, I think. It is possible to create
restrictions that eliminate the canonical forms for some values. Some of
us in the WG have been nervous since day 1 about pattern restrictions that
operate on the lexical space, but users do seem to want them. The WG has
given some attention to the possible need to clarify the rec in this area,
but I don't think we are headed toward eliminating the possibility of such
a restriction. Of course, if you were to build a c14n/signature system
that depended on such restrictions having canonical form, it would not
work. That's a second order effect. No application that depended on the
canonical forms for such a type would work.
> So, if canonical lexical representations cannot be used
> by a generic processor to canonicalize a document, then
> what are they for? Only the processors with special
> knowledge?
The can be used in the many applications where code starts with a value
and wishes to create a reasonably useful lexical form (as opposed to
00000000000001 for the integer 1). As you observe, for better or worse,
some simple types do not have canonical forms today. Applications using
such types can't get the benefit. It's in general a theorem-proving
exercise to see whether a value assigned to a datatype with pattern facet
restrictions has ANY legal lexical form, much less a canonical one.
> While I'm at it, why isn't canonical form a facet of
> the type?
IMO, because you can't alter or depend onthe canonical form when creating
restrictions. I think it would be plausible to have a "useOnlyCanonical"
facet that a restriction could set to "true", which would have the effect
of a pattern that matched all-and-only the canonical forms for a type. On
the other hand, that's creeping featurism in an already complicated spec.
> Incidentally, the above example, silly as it is,
> illustrates an important respect in which values of a
> type derived by restriction cannot be treated by a
> generic processor as values of the base type. It is a
> bit surprising that there are any such respects at all
> (if, like me, you are coming from an object-oriented
> view of "type"); I think this point deserves some
> commentary in 1.1.
I don't understand. You're "onTheHour" times aren't legal as both
lexical and value space forms for xsd:dateTime?
Thanks.
Noah
------------------------------------------------------------------
Noah Mendelsohn Voice: 1-617-693-4036
IBM Corporation Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Tuesday, 24 December 2002 14:41:06 UTC