- From: Mark Bartel <mbartel@thistle.ca>
- Date: Thu, 7 Oct 1999 16:08:20 -0400
- To: "'Ed Simon '" <ed.simon@entrust.com>, "''w3c-ietf-xmldsig@w3.org' '" <w3c-ietf-xmldsig@w3.org>
In the last draft [1], I had minimal canonicalization doing three things: * normalizing the character encoding to UTF-8, removing the encoding pseudo-attribute * normalizing line endings * normalizing white space I wasn't certain about the last item (white space normalization), and I'm happy to see it dropped as per the teleconference. The controversial item was removing the encoding pseudo-attribute. Normalizing to UTF-8 is not useful if we keep a reference to the original encoding (this is exactly what Don says in his last point below). Isn't the whole point of canonicalization to allow changes that don't affect the meaning? Selecting minimal canonicalization means that as far as the application is concerned the character encoding doesn't matter. I think part of the resistance to this is that it makes the minimal canonicalization slightly XML aware. But how will the minimal canonicalization determine the character encoding to convert from if the resource isn't XML? An arbitrary text resource may have no indication of how it is encoded. Basically, I think we MUST remove the encoding pseudo-attribute if the UTF-8 conversion is to be useful. -Mark Bartel JetForm [1] http://www.w3.org/Signature/Drafts/xmldsig-core-991001.html -----Original Message----- From: Ed Simon To: 'w3c-ietf-xmldsig@w3.org' Sent: 10/7/99 1:45 PM Subject: Re-posting of "what is minimal caonicalization" notes During today's t12e (<- my new abbreviation for "teleconference"), there were issues raised about what "minimal canonicalization" entails. I mentioned that there were a couple of appends on those subjects posted some time ago and I would repost them. Here they are. The text of my original note is indicated by leading ">"s, Don's response is flush with the margin. Ed ------------------------------ Message-Id: <199909141846.OAA08123@torque.pothole.com> To: <w3c-ietf-xmldsig@w3.org> In-reply-to: Your message of "Thu, 09 Sep 1999 15:05:25 EDT." <01E1D01C12D7D211AFC70090273D20B105E74B@sothmxs06.entrust.com> <0267.html> Date: Tue, 14 Sep 1999 14:46:31 -0400 From: "Donald E. Eastlake 3rd" <dee3@torque.pothole.com> Subject: Re: Character canonicalization and XML encoding declarations From: Ed Simon <ed.simon@entrust.com> Resent-Date: Thu, 9 Sep 1999 15:10:32 -0400 (EDT) Resent-Message-Id: <199909091910.PAA23533@www19.w3.org> Message-ID: <01E1D01C12D7D211AFC70090273D20B105E74B@sothmxs06.entrust.com> To: "'w3c-ietf-xmldsig@w3.org'" <w3c-ietf-xmldsig@w3.org> Date: Thu, 9 Sep 1999 15:05:25 -0400 >Intro: >The prolog of an XML instance contains an (optional) encoding declaration >that specifies the XML instance's character encoding. >For example, > > <?xml version="1.0" encoding="ISO-8859-1"?> > <doc>Hello World</doc> > >XML documents that do not have an encoding declaration must be encoded >as either UTF-8 or UTF-16. According to section 4.3.3 of >"<http://www.w3.org/TR/1998/REC-xml-19980210>". A parser can tell the >difference by checking if there are any Byte Order Marks in the instance. >(This seems to imply that to be sure of the character encoding, the parser >must go through the entire XML instance; might this be >problematic in some applications.) Actually, the Byte Order Mark is supposed to be the first character and is part of the encoding, not part of the document... >Here's a couple of problem scenarios... > >Problem scenario 1: > >Suppose one wants to canonicalize the character encoding to UTF-8 for >the following XML instance: > > > <?xml version="1.0" encoding="ISO-8859-1"?> > <doc>Hello World</doc> > >Do we agree the result should be > > <?xml version="1.0" encoding="UTF-8"?> > <doc>Hello World</doc> > > (note change is to the encoding declaration) > >OR > > <?xml version="1.0" encoding="ISO-8859-1"?> > <doc>Hello World</doc> > > (note no change to encoding declaration) > > >I vote for changing the encoding declaration. Everyone agree? Not really... You could go either way here... I would think that in many applications you would not want the signature to cover the prolog encoding attribute so it should be omitted either by leaving it out, which if fine for UTF-8 and UTF-16 (the only two encodings support for which is mandatory), or by using canonicalizations/transformations that omit it. But that kind of begs the question since we are defining a canonicalization. If you did do a signature covering it, it must be important, likely the document is stored/transmitted in the specified encoding so you need the encoding to understand it, and it should not be changed. If may seem odd that the signature and verification processes would re-encode 'encoding="ISO-8859-1"' into UTF-8 before hashing but that's the only way to actually sign the original encoding attribute. Maybe there is some case where changing this attribute would change the meaning for an application, which you could do if it were not signed but rather smashed to UTF-8 and the 'encoding="UTF-8"' signed. Because you could go either way, in my mind the deciding factor is simplicity. For a minimal canonicalization that does not understand XML syntax it seems like too much of a pain to figure out if you have a prolog and actually change this attribute. >Problem scenario 2: > >Suppose one has the following XML instance: > > <?xml version="1.0" encoding="UCS-4"?> > <doc> > > <stuff-that-maps-directly-to-utf8> > Nothin but US-ASCII with codes less than 128. > </stuff-that-maps-directly-to-utf8> > > <stuff-that-requires-more-complicated-conversion-to-utf8> > {Assume there are UCS-2, UCS-4, > or other multi-byte character encoding here.} > </stuff-that-requires-more-complicated-conversion-to-utf8> > > </doc> > >And suppose we want to sign one of <doc>'s two child elements. Do we >require that the extraction mechanism indicate the character encoding >of the content it is giving us? I say yes, because we need to know >how to convert the content to UTF-8. I agree with you here but I don't see how a system could ever avoid indicating the character encoding. Once it is extracted, some XML no longer has the initial <? and stuff by which we can figure out its encoding and character set. A sequence of bytes in memory can be ambiguous. You would want explicit indication of the encoding accompanying the extract for any use I can think of. >One might ask whether, for security (not technical) reasons, >the character encoding of the original XML instance needs to be >signed. If a signature does not capture the original character >encoding, and if one cannot unambiguously determine the character >encoding from solely the resultant UTF-8, then does the meaning >of what was signed become ambiguous? I say no, because the >meaning of the UTF-8 is not ambiguous. Does everyone agree? >Does this mean there is no requirement to capture the orignal >character encoding in the signature? If you are defining a canonicalization that normalizes encoding to UTF-8, it should not introduce any explicit indication of the original character encoding. Just as if you are defining a canonicalization that normalizes line endings, it should not introduce an explicit indication of what the un-canonicalized line-ends were. If you make explicit note in the output of everything you are canonicalizing, you might as well have used the null canonicaliztion. >Regards, Ed Thanks, Donald
Received on Thursday, 7 October 1999 16:08:29 UTC