RE: New XMLP Issue Relating to Canonical Forms from Martin Gudgin on 2003-10-10 (xml-dist-app@w3.org from October 2003)

From: Martin Gudgin <mgudgin@microsoft.com>
Date: Fri, 10 Oct 2003 08:00:24 -0700
To: <noah_mendelsohn@us.ibm.com>, <rsalz@datapower.com>
Cc: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>, <xml-dist-app@w3.org>
Message-ID: <7C083876C492EB4BAAF6B3AE0732970E0CE58277@red-msg-08.redmond.corp.microsoft.com>
Noah,

Isn't it the case that in MTOM, assuming you actually started with the
binary ( which is the reality in most cases ), then there is no way to
tell what, if any, whitespace was present in the base64 characters,
because you didn't have them. And I doubt, outside of using XMLDSIG,
whether anyone will care if it gets surfaced as a stream of characters
with no whitespace or a stream of characters with a newline every 76
chars ( or whatever the schema canonical form is ).

Now, when we are using XMLDSIG it becomes pretty critical, because the
current C14N algorithms will leave in new lines in base64 character
streams. So this

rfch2KY+/eDKm7qW+W6RD5NQFhawtNBLJGmxQJMej8TXzefe3p9PPys1SeA/BH5vdrAzsMUW
t/gn
1rQ8jL1NrAv1vyuHszHzAdfqemvvCx/rQsni6t//wBbutwfqErJu6J3USg==

will not be changed by a C14N algorithm. Nor will this:

/IyEIgAoPnWhN1iVJ/7mlxeAQfiaNZhEDl3mXbYY9di+hbrGKL6W3MPXkXj8e9HZLG8=

So, if we're using existing C14N algorithms ( which seems likely ) we
definitely need to pick a form for the data to be surfaced in, if anyone
does ever ask for the text, just so that signatures will not get broken.

IF we were using new C14N algorithms that were MTOM aware, we could
dispense with the base64 chars altogether, although that would require
the algorithm to emit a byte stream ( rather than an Xpath node set ).
Alternatively we could define a transform that converts the base64
content of optimized elements into some known form.

So I think it is somewhat related to XMLDSIG/C14N because it's those
technologies which will care about the whitespace ( or lack thereof ) in
the base64 character streams. In general applications will NOT care (
because they'll just be asking for the raw bits anyway ).

Gudge

> -----Original Message-----
> From: xml-dist-app-request@w3.org 
> [mailto:xml-dist-app-request@w3.org] On Behalf Of 
> noah_mendelsohn@us.ibm.com
> Sent: 10 October 2003 15:36
> To: rsalz@datapower.com
> Cc: Elliotte Rusty Harold; xml-dist-app@w3.org
> Subject: Re: New XMLP Issue Relating to Canonical Forms
> 
> 
> Sorry, but I think there's been some confusion here.  The 
> current discussion bears no immediate relation to XML c14n, 
> DSig, etc.  It's actually more fundamental to any use of SOAP 
> with MTOM, independent of whether XML DSig or the associated 
> c14n Recs are to be used.  In brief: 
> we've been referring to canonical forms of schema datatypes, 
> as defined in the datatypes recommendation, as opposed to the 
> term canonical as 
> introduced by the c14n recs that are used in conjunction with 
> DSig.   The 
> following explains in more detail.
> 
> The trick in MTOM is basically to say that for data known to 
> be in a lexical form corresponding to xsd:base64Binary, 
> sending the value (in the sense of XML schema value space) is 
> sufficient to reconstruct the lexical form.  This would be 
> like saying for integers that you can reconstruct the three 
> character sequence '1' '2' '3' by sending the value that in 
> java would be int i = 123.  The point is that, in the case of 
> integers, that's true only if you know that the integer has 
> no leading zero (or that it invariably has one leading zero, 
> or whatever.)  In short, if the lexical and value forms are 
> exactly 1-to-1, then this trick works. 
> 
> The problem is that the lexical forms for base64Binary, as 
> proposed in the schema erratum, allow for variability in 
> whitespace in the lexical form. 
> So, if you just send the 'value', you can't be sure whether 
> or not the original characters had whitespace embedded or 
> not, as the same value corresponds to more than one lexical form.
> 
> The rules of the SOAP Recommendation apply before you even 
> consider use of XML c14n and/or DSig:  they state that any 
> legal SOAP binding must faithfully transmit the infoset, 
> which means leading zeros if present for 
> integers, whitespace in base64Binary, etc.   Indeed,  the 
> Infoset and thus 
> SOAP envelopes are not type aware:  at the level of SOAP 
> envelopes there 
> is no such thing as an integer, just character sequences.   I 
> therefore 
> believe that the MTOM "trick" can be applied only to one 
> lexical form for each base64Binary value, and I have 
> suggested that it be the form called out as "canonical" in 
> the erratum to the schema datatypes specification. 
> 
> This is a different business than the particular c14n Recs  
> that have been built to aid DSig, I think.  While it would be 
> plausible to invent new ones that were datatype-aware and 
> that, for example, stripped leading zeros on integers and put 
> base64Binary in canonical forms, I don't believe 
> the current c14n rec does that.   Whether it should is a separate 
> discussion, and not something on which I (or anyone else in 
> this discussion as far as I can tell) has offered a 
> recommendation.  FWIW, I think we should always tread slowly 
> when considering making XML type aware.  MTOM does it purely 
> for purposes of optimization.  Query and schema do it for 
> reasons that I think are important (e.g. so I can talk about 
> all the age attributes that have a value>50...you presumbably 
> want to do such comparisons numerically).  SOAP has carefully 
> stayed away from anything that normatively depends on schema 
> validation, and even the encodings on SOAP 1.2 only assign 
> type names, not value spaces and semantics.  The only reason 
> I can see for doing type-aware c14n for dsig is if it proves 
> valuable for user applications, or perhaps in conjunction 
> with XML Query.  Certainly nothing in this discussion was 
> meant to relate directly to the c14n Rec or to dsig.  It's 
> merely been to decide which lexical forms are subject to MTOM 
> optimization.  Thanks!
> 
> ------------------------------------------------------------------
> Noah Mendelsohn                              Voice: 1-617-693-4036
> IBM Corporation                                Fax: 1-617-693-8676
> One Rogers Street
> Cambridge, MA 02142
> ------------------------------------------------------------------
> 
> 
> 
> 
> 
> 
> 
> Rich Salz <rsalz@datapower.com>
> 10/10/03 10:00 AM
> 
>  
>         To:     Elliotte Rusty Harold <elharo@metalab.unc.edu>
>         cc:     Noah Mendelsohn/Cambridge/IBM@Lotus, 
> "xml-dist-app@w3.org" 
> <xml-dist-app@w3.org>
>         Subject:        Re: New XMLP Issue Relating to Canonical Forms
> 
> 
> > XML canonicalization does not perform Unicode normalization on text,
> 
> No, but it will add whitespace (a newline) if there are PI or 
> comment nodes before or after the first element node.
> 
>                  /r$
> 
> --
> Rich Salz                  Chief Security Architect
> DataPower Technology       http://www.datapower.com
> XS40 XML Security Gateway  http://www.datapower.com/products/xs40.html
> XML Security Overview      
> http://www.datapower.com/xmldev/xmlsecurity.html
> 
> 
> 
> 
> 
>
Received on Friday, 10 October 2003 11:00:13 UTC