- From: Joseph M. Reagle Jr. <reagle@w3.org>
- Date: Thu, 22 Feb 2001 15:04:26 -0500
- To: www-i18n-comments@w3.org
- Cc: Misha Wolf <Misha.Wolf@reuters.com>, "IETF/W3C XML-DSig WG" <w3c-ietf-xmldsig@w3.org>
Reviewers: Don Eastlake and Joseph Reagle http://www.w3.org/TR/2001/WD-charmod-20010126/ We're very glad to see this specification advanced as it is a very useful reference -- and educational tool for understanding issues of chracter representation. Consequently, our comments are mostly editorial and relate to confusions experienced as a reader. A few references are made with respect to sections that relate to XML Signature, but these issues have been largely addressed by the last call of the XML Signature WG's documents: Core and Canonical XML. __ 1.1 Goals and Scope All W3C specifications have to conform to this document (see section [57]2 Conformance). Authors of other specifications (for example, IETF specifications) are strongly encouraged to take guidance from it. As an aside, while we strongly support this goal, this sort of requirement is atypical and maybe should sit somewhere else in part of the W3C process/guide which is capable of enforcing it? __ 3.1.2 Units of a Writing System, and Units of Aural Rendering Please define phoneme, (as distinct from meaning), and syllabaries. __ 3.1.3 Units of Visual Rendering [Unicode] requires that characters are stored and interchanged in logical order. Please define "logical order" (or cite definition). Presumably it means the order in which the characters are read by a knowledgable person which is independent of the order they are printed which could be left to right or right to left depending on language or even either at the choice of the writer for some things like Egyptian hieroglyphics. __ 3.1.5 Units of Collation Software developers MUST NOT merely use a one-to-one mapping as their string-compare function, as in sorting operations. What are you suggesting they do? Relying upon human context to determine order seems rather haphazard. For instance, how do you sort the words in an English document which contains excerpts from a Spanish document containing sequences such as "ch" and "ll" which are considered atomic collation units in their native document, but not the document in which they are in? __ 3.2 Digital Representation of Characters 3. To enable use in computers, a suitable base datatype is identified (such as a byte, a 16-bit wyde or other) and a character encoding form (CEF) is used, which encodes the abstract integers of a CCS into sequences of the code units of the base datatype. Note "wyde" typo. Much of this summary is fairly easy to understand and is demonstrated in Appendix A. However, the distinction between CEF and CES is not very clear and might merit an example -- if it can be done simply, getting in to endian and BOM might confuse the case... __ 3.5 Reference Processing Model What does "arbitrarily restrict the range of characters that can be used" mean? Is any non-arbtirary reason good enough for restriction? For example, is the existence of programming languages with a bias toward null terminated "strings" a good enough reason to prohibit the code point U+0000 unless it can be guaranteed that no such programming language will ever be used to process the data? __ 3.6 Choice and Identification of Character Encodings I think some reason should be given when UTF-16 is more appropriate than UTF-8 for APIs. If one is using UTF-8 on the wire, might it not be easier to use it everywhere in some application? __ 3.6.1 Character Encoding Identification Because of the layered Web architecture (e.g. formats used over protocols), there may be multiple and at times conflicting information about character encoding. Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for these cases, and implementers and content developers MUST follow them carefully. This requirement can be relevant to dsig that there is a type attribute (of type URI) that could identify the encoding of an identified resource being signed. However, the signature text speaks of dsig types, not MIME types though MIME types when represented as a URI could be included: http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-Reference 4.3.3 The Reference Element . The Type attribute facilitates the processing of referenced data. For example, while this specification makes no requirements over external data, an application may wish to signal that the referent is a Manifest. If someone did use this to describe the MIME type, the dsig spec does not address how to resolve conflicting information and leaves it to the application. __ 3.6.2 Private Use Code Points The recommendation that private-use code points be allowed but prohibition against any mechanism to facilitate private agreements concerning these code points in any protocol seems bizarre. Why not leave it up to protocol designers to determine if they will include a mechanism for private extensions or negotiation of privately defined options? __ 3.7 Character Escaping [Eastlake] While it would be fine to say that the number of ways to escape a character should be minimized, the statement "There SHOULD be only one way to escape a character." seems wrong. Every decent language or syntax of any power has 2 or 3, from allowing both single and double quotes, so each can quote the other, to both a single character escape ("absquote") which escapes the following character and a string quote mechanism, to CDATA sections plus the other mechanisms in XML, etc. A recommendation violated by every computer language of reasonable complexity and power I can think of off hand would require extraordinarily strong justification which I don't see. [Reagle] I support the "SHOULD" token and find multiple ways to escape a character adds to complexity. __ 4 Early Uniform Normalization 4.1 Motivation This document also specifies that normalization is to be performed early (by the sender) as opposed to late (by the recipient). Note, the dsig specification RECOMMENDS but does not require the signature be in NFC: http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-XML-Canonicalization We RECOMMEND that signature applications create XML content (Signature elements and their descendents/content) in Normalization Form C [NFC] and check that any XML being consumed is in that form as well (if not, signatures may consequently fail to validate). __ 4.3 Responsibility for Normalization Note: The prohibition of normalization by recipients is necessary for consistency, on which security depends. DSIG is compliant with this: http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-See 8.1.3 "See" What is Signed Consequently, while we RECOMMEND all documents operated upon and generated by signature applications be in [NFC] (otherwise intermediate processors might unintentionally break the signature) encoding normalizations SHOULD NOT be done as part of a signature transform, or (to state it another way) if normalization does occur, the application SHOULD always "see" (operate over) the normalized form. __ 8 Character Encoding in URI References This chapter defines how to address this issue in W3C specifications in a way consistent with the model defined in this document and with deployed practice. DSIG is compliant with this, see: http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-URI __ __ Joseph Reagle Jr. http://www.w3.org/People/Reagle/ W3C Policy Analyst mailto:reagle@w3.org IETF/W3C XML-Signature Co-Chair http://www.w3.org/Signature W3C XML Encryption Chair http://www.w3.org/Encryption/2001/
Received on Thursday, 22 February 2001 16:06:58 UTC