Re: New XMLP Issue Relating to Canonical Forms from noah_mendelsohn@us.ibm.com on 2003-10-10 (xml-dist-app@w3.org from October 2003)

From: <noah_mendelsohn@us.ibm.com>
Date: Fri, 10 Oct 2003 10:36:01 -0400
To: rsalz@datapower.com
Cc: Elliotte Rusty Harold <elharo@metalab.unc.edu>, "xml-dist-app@w3.org" <xml-dist-app@w3.org>
Message-ID: <OF1972DE37.C29E1F79-ON85256DBB.004F189C@lotus.com>
Sorry, but I think there's been some confusion here.  The current 
discussion bears no immediate relation to XML c14n, DSig, etc.  It's 
actually more fundamental to any use of SOAP with MTOM, independent of 
whether XML DSig or the associated c14n Recs are to be used.  In brief: 
we've been referring to canonical forms of schema datatypes, as defined in 
the datatypes recommendation, as opposed to the term canonical as 
introduced by the c14n recs that are used in conjunction with DSig.   The 
following explains in more detail.

The trick in MTOM is basically to say that for data known to be in a 
lexical form corresponding to xsd:base64Binary, sending the value (in the 
sense of XML schema value space) is sufficient to reconstruct the lexical 
form.  This would be like saying for integers that you can reconstruct the 
three character sequence '1' '2' '3' by sending the value that in java 
would be int i = 123.  The point is that, in the case of integers, that's 
true only if you know that the integer has no leading zero (or that it 
invariably has one leading zero, or whatever.)  In short, if the lexical 
and value forms are exactly 1-to-1, then this trick works. 

The problem is that the lexical forms for base64Binary, as proposed in the 
schema erratum, allow for variability in whitespace in the lexical form. 
So, if you just send the 'value', you can't be sure whether or not the 
original characters had whitespace embedded or not, as the same value 
corresponds to more than one lexical form.

The rules of the SOAP Recommendation apply before you even consider use of 
XML c14n and/or DSig:  they state that any legal SOAP binding must 
faithfully transmit the infoset, which means leading zeros if present for 
integers, whitespace in base64Binary, etc.   Indeed,  the Infoset and thus 
SOAP envelopes are not type aware:  at the level of SOAP envelopes there 
is no such thing as an integer, just character sequences.   I therefore 
believe that the MTOM "trick" can be applied only to one lexical form for 
each base64Binary value, and I have suggested that it be the form called 
out as "canonical" in the erratum to the schema datatypes specification. 

This is a different business than the particular c14n Recs  that have been 
built to aid DSig, I think.  While it would be plausible to invent new 
ones that were datatype-aware and that, for example, stripped leading 
zeros on integers and put base64Binary in canonical forms, I don't believe 
the current c14n rec does that.   Whether it should is a separate 
discussion, and not something on which I (or anyone else in this 
discussion as far as I can tell) has offered a recommendation.  FWIW, I 
think we should always tread slowly when considering making XML type 
aware.  MTOM does it purely for purposes of optimization.  Query and 
schema do it for reasons that I think are important (e.g. so I can talk 
about all the age attributes that have a value>50...you presumbably want 
to do such comparisons numerically).  SOAP has carefully stayed away from 
anything that normatively depends on schema validation, and even the 
encodings on SOAP 1.2 only assign type names, not value spaces and 
semantics.  The only reason I can see for doing type-aware c14n for dsig 
is if it proves valuable for user applications, or perhaps in conjunction 
with XML Query.  Certainly nothing in this discussion was meant to relate 
directly to the c14n Rec or to dsig.  It's merely been to decide which 
lexical forms are subject to MTOM optimization.  Thanks!

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------







Rich Salz <rsalz@datapower.com>
10/10/03 10:00 AM

 
        To:     Elliotte Rusty Harold <elharo@metalab.unc.edu>
        cc:     Noah Mendelsohn/Cambridge/IBM@Lotus, "xml-dist-app@w3.org" 
<xml-dist-app@w3.org>
        Subject:        Re: New XMLP Issue Relating to Canonical Forms


> XML canonicalization does not perform Unicode normalization on text,

No, but it will add whitespace (a newline) if there are PI or comment
nodes before or after the first element node.

                 /r$

--
Rich Salz                  Chief Security Architect
DataPower Technology       http://www.datapower.com
XS40 XML Security Gateway  http://www.datapower.com/products/xs40.html
XML Security Overview      http://www.datapower.com/xmldev/xmlsecurity.html
Received on Friday, 10 October 2003 10:40:52 UTC