- From: <dee3@us.ibm.com>
- Date: Mon, 19 Apr 1999 15:57:10 -0400
- To: w3c-xml-sig-ws@w3.org
See comments below preceeded by ### Donald E. Eastlake, 3rd 17 Skyline Drive, Hawthorne, NY 10532 USA dee3@us.ibm.com tel: 1-914-784-7913, fax: 1-914-784-3833 home: 65 Shindegan Hill Road, RR#1, Carmel, NY 10512 USA dee3@torque.pothole.com tel: 1-914-276-2668 "Milton M. Anderson" <miltonma@gte.net> on 04/18/99 10:09:11 AM To: w3c-xml-sig-ws@w3.org cc: (bcc: Donald Eastlake/Hawthorne/IBM) Subject: Canonicalization from a Digital Signature Point of View Based on the discussion about canonicalization, we may need a more precise definition of canonicalization from a digital signature point of view. The following is a preliminary proposal: Let Xi be an XML document. Canonicalization is a function C, such that C(Xi) = M produces a variable length octet string M suitable for input to a cryptographic hash function. Given a set of XML documents, X1, ..., Xn, canonicalization must have an "equivalence" property: 1. If C(Xi) = C(Xj) for any pair of documents Xi and Xj in the set, then Xi and Xj must have the same legal meaning, business information, and aesthetic value (assuming we wish to have lawyers make contracts, business people communicate data, and authors sign works). ### Well, the legal meaning, what is considered business information, and what the aesthetic value of an XML document is depends on the use/application that is used with it. Applications that use DOM just can't "see" certain things in the XML that applications written at a lower level can. Some applications might depend on exact Unicode match to arbitrary Unicode strings in the application, such that normalizing the Unicode causes them to fail. Others might assume a specific normalized form of Unicode such that Unicode canonicalization is required. This might seem to imply a separate canonicalization function per application but I don't think it need be that bad. ### I am however begining to think that there may be good reasons for more standard canonicalization and that might not be too much of a burden if some are very simple and/or subsets of other canonicalizations. For example, maybe you have Null, Char, DOM, and DOMuni as follows Null - does nothing, i.e. hashes the original raw byte stream Char - converts to UTF-16 and hashes bytes in network order DOM - DOM HASH with tweaks, includes conversion to UTF-16 DOMuni - DOM but with Unicode canonicalized to the combined forms or perhaps things could be even more othogonal. If you had four different sequential canonicalizing transformations you could turn on or off, then having sixteen "different" mandatory to implement canonicalization might not be too much of a burden. Canonicalization must also have an "exclusion" property: 2. It must be computationally infeasible to find a document Xx, not in the set X1, ..., Xn, such that C(Xx) = C(Xi) and such that Xx has a different legal meaning, business information, and aesthetic value than Xi. ### Same comments as above on property 1. To be useful for processing XML documents that are communicated among digital signature signers and verifiers, cannonicalization must also have a "completeness" property: 3. If the set of XML documents X1, ..., Xn is produced by the application in any arbitrary sequence of conforming, but differing, XML parsers and generators, then C(Xi) must equal C(Xj) for every pair Xi and Xj. In the last case, generators should be understood to convert DOM representations to concretely encoded surface strings, and processors should be understood to convert concretely encoded surface strings to DOMs. ### This may be a reasonable requirement. I think it is really just a requirement that the canonicalization be well defined and reflect the DOM view of things... Since canonicalizers are required to have a many to one mapping property which is forbidden to cryptographic hashing algorithms, I think it is essential to keep the specifications quite separate, with an easily understandable representation of the document at their interface. If the canonicalized form is XML, then it is easier to show that the operation of the canonicalizer has the "equivalence" property by using a variety of commonly available XML tools to show that the output is equivalent to the input. ### As mentioned briefly in an earlier email of his, Hiroshi Maruyama has come up with a simple modifiction to DOM HASH which both can be used so its output is XML and so that it has the fix point property that C(Xi) = C(C(Xi)). The problem with that was namespace prefixes. You just replace them with the hex of the hash of the full namespace designation everywhere that prefix occurs. The requirements stated above do not appear to conflict with those in http://www.w3.org/TR/NOTE-xml-canonical-req. However, neither the "exclusion" nor the "completeness" properties are clearly stated there. They may be implied? Milton M. Anderson Technical Projects Director Financial Services Technology Consortium 276 Dartmouth Avenue Fair Haven, NJ 07704-3121 +1 732 747 1514 miltonma@gte.net
Received on Monday, 19 April 1999 16:17:28 UTC