RE: New XMLP Issue Relating to Canonical Forms from noah_mendelsohn@us.ibm.com on 2003-10-10 (xml-dist-app@w3.org from October 2003)

From: <noah_mendelsohn@us.ibm.com>
Date: Fri, 10 Oct 2003 12:01:25 -0400
To: mgudgin@microsoft.com
Cc: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>, rsalz@datapower.com, xml-dist-app@w3.org
Message-ID: <OFB2F1A5F5.CB847CDA-ON85256DBB.0056A1B0@lotus.com>
Martin Gudgin writes:

> Isn't it the case that in MTOM, assuming you actually
> started with the binary ( which is the reality in most
> cases ), then there is no way to tell what, if any,
> whitespace was present in the base64 characters,
> because you didn't have them.

No, I don't think this is true.  Even in the case of MTOM with a binary 
source, SOAP is Infoset and thus characters.  I think we need to 
distinguish what you must know in principle from what you must burn cycles 
computing if nobody actually needs to see it.

SOAP says you must have an Infoset, which means that if asked you must 
know what characters are in the Infoset, and a receiver must be capable of 
reproducing those characters.  For example, as I think you say in your 
note, an MTOM message might leave an MTOM sender and be signed using the 
already published exclusive c14n (which is type unaware.)  That's a case 
where you will actually have to take the trouble to compute the characters 
(or at least to compute a signature that will the same as if you had gone 
through the character form.)  The c14n result and the signature will be 
different, according to whether you decide that the lexical form had the 
whitespace between pieces of the binary representation or not. 

Furthermore, if that message goes through a second non-MTOM hop, you will 
have to convert to actual characters for transmission in, e.g. the SOAP 
1.2 HTTP binding.  Presumably you will check the dsig based on the actual 
characters transmitted, so you better specify precisely what they are. 

For all these reasons, I believe we must always be able to say what the 
characters are in the Envelope infoset, even if the source was binary in 
the implementation.  Then again, I completely agree that there are many 
interesting use cases in which these characters will not be explicitly 
computed.  The simple case where a value starts in binary, is sent through 
MTOM, and is processed by a receiver using a binary API is such an 
example, and an important one.

That suggests why we need to know the character form of optimized values. 
I think the use cases that prove that there must be only one such form are 
in some sense the converse.  Let's say that some sender for whatever 
reasons does have an element containing characters, and has reason to know 
that those characters are (for whatever reason) not in what the Schema 
erratum calls canonical form.  I.e. the whitespace is not where the 
canonical form says it should be.   My claim is that you MUST NOT MTOM 
encode such an element, because when received or relaying through a 
non-MTOM binding you will not reconstruct it correctly.  That's the 
crucial use case, and I think it's important.  I really don't want to 
ignore the SOAP Rec's fundamental requirement that bindings be capable of 
reconstructing the Infoset.

> IF we were using new C14N algorithms that were MTOM
> aware, we could dispense with the base64 chars
> altogether, although that would require the algorithm
> to emit a byte stream ( rather than an Xpath node set
> ).  Alternatively we could define a transform that
> converts the base64 content of optimized elements into
> some known form.

Agreed, that deals with DSIG and doing a type- (or at least MTOM-aware) 
c14n may be a good idea.  I don't think, however, that deals directly with 
the case where you are going through a second hop using a non-MTOM binding 
(it handles the signature, but nothing else), and I don't think it 
eliminates the need to faithfully transmit non-canonical forms if the 
application has explicitly provided them.  Those are the primary reasons 
that I think that MTOM has to be viewed as "canonical lexical 
representation" only.  Thanks!

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Friday, 10 October 2003 12:04:08 UTC