Re: What are canonical lexical representations for? from noah_mendelsohn@us.ibm.com on 2002-12-27 (www-xml-schema-comments@w3.org from October to December 2002)

From: <noah_mendelsohn@us.ibm.com>
Date: Fri, 27 Dec 2002 14:17:27 -0500
To: Steven Taschuk <staschuk@telusplanet.net>
Cc: www-xml-schema-comments@w3.org
Message-ID: <OFDBBE920E.2C2DBCEA-ON85256C9C.006A2B48@lotus.com>
Steven Taschuk writes:

> > > Trolling through the archives, I find a suggestion that
> > > canonicalization is useful in the context of signed
> > > XML [...]
> > 
> > Hard to comment without seeing the note in question.  [...]
> 
> Fair enough.  I refer to "XML Schema and the necessity for
> canonical representations", <dee3@us.ibm.com>, 1999-05-21:
> 
<http://lists.w3.org/Archives/Public/www-xml-schema-comments/1999AprJun/0060.html>
> 
> I gather that that note was written fairly early in the
> process, to argue for the need for canonical
> representations in the first place.  Digital signatures
> are just one example of an application for which
> canonicalization issues are important; others certainly
> exist, and I have no particular stake in signatures
> specifically.

Don was not a member of the workgroup at the time he wrote
(or ever as far as I know.)  I believe he was providing 
input reflective of one set of concerns, namely that DSIG 
is among the situations in which certain types of c14n 
transformations might be useful. 

I think it's fair to say that these were NOT the
reasons that actually led to the inclusion of canonical
forms in the schema rec.  My opinion is that the
reasons were closer to the ones expressed in my earlier
note.  I think we felt that it would be for the
security community to gather requirements as to
what should and should not be signed for various
applications of DSIG, and we certainly did not
go through such a requirements process.

> XML Schema implies a model of what XML documents
> consist of; 

Really?  I think you would find a lot of disagreement,
at least from some members of the schema team.  Schema
provides a definition of the assessment relation, which
allows you to determine if an element is valid per an
element declaratin or complex type.  Schema's model of
the document is the input infoset, which is character
based.  Schema makes available certain additional
information as a byproduct of assessment, including
indication of whether attributes were defaulted and if
so with what value, etc.   This additional informaion
is placed into the so-called PSVI.

Schema does not directly include "values" from the 
datatypes value space in the PSVI.  (Though I agree
that such values are in all cases determineable, and
I think it would have been coherent to include them
in the PSVI.)

I would say that Schemas is careful NOT to give you
a new model of the document, though it surely creates
building blocks from which such models could be
derived.  Would such a model include defaulted 
attributes?  Maybe.  Equivalence of 100 and 1E+2,
maybe.  Other groups such as Query have looked
into building data models based on the information
available from schema assessment, but the schema
WG certainly did not define such a model, in 
my opinion anyway.

> I feel it is desirable to be able to write such a
> canonicalizer for the equivalence relation under which
> documents are equivalent if they differ only in ways
> not reflected in that model.  Among other things, this
> includes the use of alternative lexical representations
> for the same value.

Well, I think it is possible to define and implement
many such canonicalizers.  The question is:  which ones
should be standardized and by whom.

> Now, how should such a canonicalizer canonicalize
> representations of user-defined simple types?  A naïve
> implementation would apply algorithms appropriate for
> the built-in types from which they are derived -- if
> this approach were sound, it would have the merit of
> being applicable to any simple type whatsoever
> (provided schema information were available).  My
> onTheHour example, however, shows that this approach
> can generate "canonical" documents that are not
> schema-valid.

I think that I have acknowledged that the WG is
well aware of the concern that certain user-defined
types have no canonical form, and that there
are applications of XML schema (such as DSig),
for which this state of affairs is a compromise.
Someone else in the WG will have to remind me of
exactly where we stand in considering this, but
I believe it was or is being reviewed as a possible
concern for either 1.1 or 2.0 (should we decide
to build such versions.)

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Friday, 27 December 2002 14:21:21 UTC