Re: Re-posting of "what is minimal caonicalization" notes from Donald E. Eastlake 3rd on 1999-10-11 (w3c-ietf-xmldsig@w3.org from October to December 1999)

From: Donald E. Eastlake 3rd <dee3@torque.pothole.com>
Date: Mon, 11 Oct 1999 12:59:23 -0400
To: Mark Bartel <mbartel@thistle.ca>
cc: "''w3c-ietf-xmldsig@w3.org' '" <w3c-ietf-xmldsig@w3.org>
Message-Id: <199910111659.MAA00822@torque.pothole.com>
Hi,

From:  Mark Bartel <mbartel@thistle.ca>
Resent-Date:  Thu, 7 Oct 1999 16:08:33 -0400 (EDT)
Resent-Message-Id:  <199910072008.QAA09833@www19.w3.org>
Message-ID:  <91F20911A6C0D2118DF80040056D77A20329E9@arren.cpu1634.adsl.bellglobal.co
m>
To:  "'Ed Simon '" <ed.simon@entrust.com>,
            "''w3c-ietf-xmldsig@w3.org' '"
    	 <w3c-ietf-xmldsig@w3.org>
Date:  Thu, 7 Oct 1999 16:08:20 -0400 

>In the last draft [1], I had minimal canonicalization doing three things:
>* normalizing the character encoding to UTF-8, removing the encoding
>pseudo-attribute
>* normalizing line endings
>* normalizing white space
>
>I wasn't certain about the last item (white space normalization), and I'm
>happy to see it dropped as per the teleconference.

(The proposal was to normalize white space between a close angle
bracket and a following open angle bracket, where they are separate
only by white space.  Thus it would have not actually required XML
knowledge or parsing but could be based just on simple character
sequence.)

>The controversial item was removing the encoding pseudo-attribute.
>Normalizing to UTF-8 is not useful if we keep a reference to the original
>encoding (this is exactly what Don says in his last point below).  Isn't the
>whole point of canonicalization to allow changes that don't affect the
>meaning?  Selecting minimal canonicalization means that as far as the
>application is concerned the character encoding doesn't matter.

I prefaced my previous remarks with a statement that you could go
either way on this question.

In addition to my previous arguments, I would point out that you are
assuming that all changes in the encoding of characters are XML aware.
Depending on the channels and storage XML signed data goes through, it
might already be in an encoding different from the prolog attribute if
the prolog attribute is even present.

>I think part of the resistance to this is that it makes the minimal
>canonicalization slightly XML aware.  But how will the minimal
>canonicalization determine the character encoding to convert from if the
>resource isn't XML?  An arbitrary text resource may have no indication of
>how it is encoded.

Many things signed will not be XML documents, either because they are
binary data or because they are just a hunk of XML rather than a well
formed document.  I think that the encoding will have to be supplied
as a parameter with the data in the general case.  Transformation
canonicalizations are steps in a pipline.  The way to look at this is
what do you get from the previous stage and hand to the next stage?
Seems to me that it must be an octet string and an encoding.
Otherwise, encoding transformations would have a hard time telling
what to do.  So minimal canonicalization, whether its the first, a
middle, or the last Transmformation, could expect to by handed a
string of bytes and the encoding they are in.  Thus it is a
substantial complexity and additional work for it to parse the
material which might not even be balance or XML.  By ignoring the
possible presence of an encoding psuedo-attribute, you get a much
simpler canonicalization (that happens to be quite useful for text
also).

>Basically, I think we MUST remove the encoding pseudo-attribute if the UTF-8
>conversion is to be useful.

It's true that leaving it in means that under some circumstances
people will have to explictly ignore it or perhaps even the entire
prolog if that is what they want.  Of course, there is no additional
effort to do this for a particular signed item if they are already
extracting / filtering-to what they are signing or not dealing with a
full document so there is no prolog anyway.  If they are signing a
form, wouldn't it be a common case to ignore all preceeding and
following comments and the prolog, for example?

>-Mark Bartel
>JetForm
>
>[1] http://www.w3.org/Signature/Drafts/xmldsig-core-991001.html

Thanks,
Donald

>-----Original Message-----
>From: Ed Simon
>To: 'w3c-ietf-xmldsig@w3.org'
>Sent: 10/7/99 1:45 PM
>Subject: Re-posting of "what is minimal caonicalization" notes
>
>During today's t12e (<- my new abbreviation for "teleconference"), there
>were issues
>raised about what "minimal canonicalization" entails.  I mentioned that
>there were
>a couple of appends on those subjects posted some time ago
>and I would repost them. Here they are.
>The text of my original note is indicated by leading ">"s, Don's
>response is
>flush
>with the margin.
>Ed
>------------------------------
>Message-Id: <199909141846.OAA08123@torque.pothole.com>
>To: <w3c-ietf-xmldsig@w3.org>
>In-reply-to: Your message of "Thu, 09 Sep 1999 15:05:25 EDT."
> 
><01E1D01C12D7D211AFC70090273D20B105E74B@sothmxs06.entrust.com>
><0267.html>
>Date: Tue, 14 Sep 1999 14:46:31 -0400
>From: "Donald E. Eastlake 3rd" <dee3@torque.pothole.com>
>Subject: Re: Character canonicalization and XML encoding declarations 
>
>
>From:  Ed Simon <ed.simon@entrust.com>
>Resent-Date:  Thu, 9 Sep 1999 15:10:32 -0400 (EDT)
>Resent-Message-Id:  <199909091910.PAA23533@www19.w3.org>
>Message-ID:
><01E1D01C12D7D211AFC70090273D20B105E74B@sothmxs06.entrust.com>
>To:  "'w3c-ietf-xmldsig@w3.org'" <w3c-ietf-xmldsig@w3.org>
>Date:  Thu, 9 Sep 1999 15:05:25 -0400 
>
>>Intro:
>>The prolog of an XML instance contains an (optional) encoding
>declaration
>>that specifies the XML instance's character encoding.
>>For example,
>>
>>   <?xml version="1.0" encoding="ISO-8859-1"?>
>>   <doc>Hello World</doc>
>>
>>XML documents that do not have an encoding declaration must be encoded
>>as either UTF-8 or UTF-16.  According to section 4.3.3 of
>>"<http://www.w3.org/TR/1998/REC-xml-19980210>".  A parser can tell the
>>difference by checking if there are any Byte Order Marks in the
>instance.
>>(This seems to imply that to be sure of the character encoding, the
>parser
>>must go through the entire XML instance; might this be
>>problematic in some applications.)
>
>Actually, the Byte Order Mark is supposed to be the first character
>and is part of the encoding, not part of the document...
>
>>Here's a couple of problem scenarios...
>>
>>Problem scenario 1:
>>
>>Suppose one wants to canonicalize the character encoding to UTF-8 for
>>the following XML instance:
>>
>>
>>   <?xml version="1.0" encoding="ISO-8859-1"?>
>>   <doc>Hello World</doc>
>>
>>Do we agree the result should be
>>
>>   <?xml version="1.0" encoding="UTF-8"?>
>>   <doc>Hello World</doc>
>>
>>   (note change is to the encoding declaration)
>>
>>OR
>>
>>   <?xml version="1.0" encoding="ISO-8859-1"?>
>>   <doc>Hello World</doc>
>>
>>   (note no change to encoding declaration)
>>
>>
>>I vote for changing the encoding declaration.  Everyone agree?
>
>Not really...  You could go either way here...  I would think that in
>many applications you would not want the signature to cover the prolog
>encoding attribute so it should be omitted either by leaving it out,
>which if fine for UTF-8 and UTF-16 (the only two encodings support for
>which is mandatory), or by using canonicalizations/transformations
>that omit it.  But that kind of begs the question since we are
>defining a canonicalization.  If you did do a signature covering it,
>it must be important, likely the document is stored/transmitted in the
>specified encoding so you need the encoding to understand it, and it
>should not be changed.  If may seem odd that the signature and
>verification processes would re-encode 'encoding="ISO-8859-1"' into
>UTF-8 before hashing but that's the only way to actually sign the
>original encoding attribute.  Maybe there is some case where changing
>this attribute would change the meaning for an application, which you
>could do if it were not signed but rather smashed to UTF-8 and the
>'encoding="UTF-8"' signed.  Because you could go either way, in my
>mind the deciding factor is simplicity.  For a minimal
>canonicalization that does not understand XML syntax it seems like too
>much of a pain to figure out if you have a prolog and actually change
>this attribute.
>
>>Problem scenario 2:
>>
>>Suppose one has the following XML instance:
>>
>>   <?xml version="1.0" encoding="UCS-4"?>
>>   <doc>
>>
>>   <stuff-that-maps-directly-to-utf8>
>>   Nothin but US-ASCII with codes less than 128.
>>   </stuff-that-maps-directly-to-utf8>
>>
>>   <stuff-that-requires-more-complicated-conversion-to-utf8>
>>   {Assume there are UCS-2, UCS-4,
>>   or other multi-byte character encoding here.}
>>   </stuff-that-requires-more-complicated-conversion-to-utf8>
>>
>>   </doc>
>>
>>And suppose we want to sign one of <doc>'s two child elements.  Do we
>>require that the extraction mechanism indicate the character encoding
>>of the content it is giving us?  I say yes, because we need to know
>>how to convert the content to UTF-8.
>
>I agree with you here but I don't see how a system could ever avoid
>indicating the character encoding.  Once it is extracted, some XML no
>longer has the initial <? and stuff by which we can figure out its
>encoding and character set.  A sequence of bytes in memory can be
>ambiguous.  You would want explicit indication of the encoding
>accompanying the extract for any use I can think of.
>
>>One might ask whether, for security (not technical) reasons,
>>the character encoding of the original XML instance needs to be
>>signed.  If a signature does not capture the original character
>>encoding, and if one cannot unambiguously determine the character
>>encoding from solely the resultant UTF-8, then does the meaning
>>of what was signed become ambiguous?  I say no, because the
>>meaning of the UTF-8 is not ambiguous.  Does everyone agree?
>>Does this mean there is no requirement to capture the orignal
>>character encoding in the signature?
>
>If you are defining a canonicalization that normalizes encoding to
>UTF-8, it should not introduce any explicit indication of the original
>character encoding.  Just as if you are defining a canonicalization
>that normalizes line endings, it should not introduce an explicit
>indication of what the un-canonicalized line-ends were.  If you make
>explicit note in the output of everything you are canonicalizing, you
>might as well have used the null canonicaliztion.
>
>>Regards, Ed
>
>Thanks,
>Donald
>
Received on Monday, 11 October 1999 12:59:38 UTC