Re: What are canonical lexical representations for? from by way of on 2003-01-08 (www-xml-schema-comments@w3.org from January to March 2003)

From: by way of <staschuk@telusplanet.net>
Date: Wed, 08 Jan 2003 06:26:47 -0700
To: W3C XML Schema Comments list <www-xml-schema-comments@w3.org>
Message-Id: <5.1.0.14.1.20030108062641.029c15f0@localhost>
Quoth noah_mendelsohn@us.ibm.com:
 > Steven Taschuk writes:
 > > _Part 2: Datatypes_ defines canonical lexical
 > > representations for most of the built-in simple types,
 > > but their use is unclear.  [...]
 >
 > a) May wish to build implementations that start with a value and
 > eventually serialize to characters.  [...]

Ah, yes.  Good point.

 > > Trolling through the archives, I find a suggestion that
 > > canonicalization is useful in the context of signed
 > > XML [...]
 >
 > Hard to comment without seeing the note in question.  [...]

Fair enough.  I refer to "XML Schema and the necessity for
canonical representations", <dee3@us.ibm.com>, 1999-05-21:
<http://lists.w3.org/Archives/Public/www-xml-schema-comments/1999AprJun/0060.html>

I gather that that note was written fairly early in the process,
to argue for the need for canonical representations in the first
place.  Digital signatures are just one example of an application
for which canonicalization issues are important; others certainly
exist, and I have no particular stake in signatures specifically.

 > [...] Specifically, such a c14n would support signatures in
 > cases where you truly do not care that a float:
 >
 >     100
 >
 > has been rewritten as
 >
 >     1.0E+2
 >
 > The fact is, there are some applications for which you do NOT want the
 > signature to match on the above;  you want to know that someone has
 > tampered with your document.  [...] I think the W3C can at best
 > standardize c14n conventions for some of the most common use cases.

Absolutely.  Let me clarify the angle I'm approaching this from.

Whatever equivalence relation on documents I wish to use in a
particular application, it is useful to have a canonicalizer for
that relation, that is, a processor which takes as input an
arbitrary document and produces as output an equivalent canonical
form, under the equivalence relation of interest.

(This is not the only way to implement an equivalence relation,
but it has the merit of loose coupling: it permits, for example,
digital signature software, version management systems, and file
comparison tools to operate on byte or character streams without
any knowledge of the equivalence relation I deem to be most
appropriate for the case at hand.)

XML Schema implies a model of what XML documents consist of; I
feel it is desirable to be able to write such a canonicalizer for
the equivalence relation under which documents are equivalent if
they differ only in ways not reflected in that model.  Among other
things, this includes the use of alternative lexical
representations for the same value.

So far this is all obvious.

Now, how should such a canonicalizer canonicalize representations
of user-defined simple types?  A naïve implementation would apply
algorithms appropriate for the built-in types from which they are
derived -- if this approach were sound, it would have the merit of
being applicable to any simple type whatsoever (provided schema
information were available).  My onTheHour example, however, shows
that this approach can generate "canonical" documents that are not
schema-valid.

(Schema-invalid canonical documents might be tolerable if
schema-valid versions could be reconstructed at need; but as you
pointed out on a related point, this is in general a
theorem-proving exercise.)

This is the problem I refer to when I say that a canonicalizer
needs special knowledge of all the simple types it encounters,
namely knowledge of how to canonicalize representations of those
types.  This requirement seems to make canonical lexical
representations much less useful than they would otherwise be,
indeed, to make impossible what is to me the most obvious and
desirable use of them... which is what prompted my question.

   [...]
 > > While I'm at it, why isn't canonical form a facet of
 > > the type?
 >
 > IMO, because you can't alter  or depend onthe canonical form when creating
 > restrictions.  [...]

A good point.  But you can't alter the equal facet either (unless
I'm missing something in the recommendation).

 > > Incidentally, the above example, silly as it is,
 > > illustrates an important respect in which values of a
 > > type derived by restriction cannot be treated by a
 > > generic processor as values of the base type.  [...]
 >
 > I don't understand.   You're "onTheHour" times aren't legal as both
 > lexical and value space forms for xsd:dateTime?

They are, and surely that is sufficient for most processors; I
meant to refer to canonicalizers specifically, which cannot
canonicalize representations of a type as if they were
representations of the base type without sometimes producing
schema-invalid output.

-- 
Steven Taschuk           | Receive them ignorant;
staschuk@telusplanet.net | dispatch them confused.
                          |   (Weschler's Teaching Motto)
Received on Wednesday, 8 January 2003 08:31:15 UTC