Re: What are canonical lexical representations for?

Steven Taschuk writes:
> 
> _Part 2: Datatypes_ defines canonical lexical
> representations for most of the built-in simple types,
> but their use is unclear.  I'd like to see some
> amplification on this point in 1.1.

Canonical representations are used in the structures rec. in conjunction 
with setting default values, though we are considering eliminating that 
dependence eventually.  In general, canonical forms are a convenience to 
those who:

a) May wish to build implementations that start with a value and 
eventually serialize to characters.  While such implementations are free 
to use any lexical form in principle, it was felt that a suggested 
preferred form would be helpful in promoting the development of tools such 
as shared libraries, in maximimzing the proliferations of forms that users 
will be comfortable reading, etc.

b) Even when schemas itself makes no use of canonical forms, other 
specifications may do so.  Thus, we are providing a building block which 
other specifications can use to maximimize interoperability.
 
> Trolling through the archives, I find a suggestion that
> canonicalization is useful in the context of signed
> XML, when intermediate parties in a transaction might
> replace one lexical representation with a different but
> equivalent one, and it is desired that this not
> invalidate the signature.  This is a worthwhile goal,
> but it seems impossible to canonicalize a document
> without special knowledge of every type in the
> document.

Hard to comment without seeing the note in question.  My understanding is 
that the existing c14n proposals for signatures (which are not tied at all 
to the notion of canonical lexical forms in schema datatypes)  deal with 
well-formed XML, and do not consider alternate representations of the same 
type to be in the same equivalence class.    I have seen some semi-formal 
proposals to do schema-aware c14ns.  Those would have exactly the pros and 
cons you suggest.  Specifically, such a c14n would support signatures in 
cases where you truly do not care that a float:

    100

has been rewritten as

    1.0E+2

The fact is, there are some applications for which you do NOT want the 
signature to match on the above;  you want to know that someone has 
tampered with your document.  Today's c14ns do the right thing for those. 
One can imagine other systems in which the rewrite above is considered 
harmless, and in which you want the signature to match.  You might imagine 
yet another where providing a value explicitly in the instance is 
equivalent to letting it default to the same value.  In the end, you have 
to sign what you care about, and there's no obvious limit to the things 
that one user or another may want.  I think the W3C can at best 
standardize c14n conventions for some of the most common use cases.

 
> For a silly example, consider the type
>    <simpleType name='onTheHour'>
>      <restriction base='dateTime'>
>        <pattern value='.*T..:00.*'/>
>      </restriction>
>    </simpleType>
> 
> which requires the minute field of its values to be
> zero.  Canonicalizing values of this type in general is
> impossible without special knowledge of the type: an
> algorithm for canonicalizing dateTimes in general
> cannot be used since conversion of an onTheHour value
> to UTC might change the minutes field and make the
> result invalid for onTheHour.

Now you're raising a different point, I think.  It is possible to create 
restrictions that eliminate the canonical forms for some values.  Some of 
us in the WG have been nervous since day 1 about pattern restrictions that 
operate on the lexical space, but users do seem to want them.  The WG has 
given some attention to the possible need to clarify the rec in this area, 
but I don't think we are headed toward eliminating the possibility of such 
a restriction.  Of course, if you were to build a c14n/signature system 
that depended on such restrictions having canonical form, it would not 
work.  That's a second order effect.  No application that depended on the 
canonical forms for such a type would work.
 
> So, if canonical lexical representations cannot be used
> by a generic processor to canonicalize a document, then
> what are they for?  Only the processors with special
> knowledge?

The can be used in the many applications where code starts with a value 
and wishes to create a reasonably useful lexical form (as opposed to 
00000000000001 for the integer 1).  As you observe, for better or worse, 
some simple types do not have canonical forms today.   Applications using 
such types can't get the benefit.  It's in general a theorem-proving 
exercise to see whether a value assigned to a datatype with pattern facet 
restrictions has ANY legal lexical form, much less a canonical one. 

> While I'm at it, why isn't canonical form a facet of
> the type?

IMO, because you can't alter  or depend onthe canonical form when creating 
restrictions.  I think it would be plausible to have a "useOnlyCanonical" 
facet that a restriction could set to "true", which would have the effect 
of a pattern that matched all-and-only the canonical forms for a type.  On 
the other hand, that's creeping featurism in an already complicated spec.
 
> Incidentally, the above example, silly as it is,
> illustrates an important respect in which values of a
> type derived by restriction cannot be treated by a
> generic processor as values of the base type.  It is a
> bit surprising that there are any such respects at all
> (if, like me, you are coming from an object-oriented
> view of "type"); I think this point deserves some
> commentary in 1.1.

I don't understand.   You're "onTheHour" times aren't legal as both 
lexical and value space forms for xsd:dateTime?

Thanks. 

Noah

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------

Received on Tuesday, 24 December 2002 14:41:06 UTC