Re: Draft Comments on Charmod Last Call from Donald E. Eastlake 3rd on 2001-02-21 (w3c-ietf-xmldsig@w3.org from January to March 2001)

From: Donald E. Eastlake 3rd <dee3@torque.pothole.com>
Date: Wed, 21 Feb 2001 00:59:30 -0500
To: "Joseph M. Reagle Jr." <reagle@w3.org>
cc: "IETF/W3C XML-DSig WG" <w3c-ietf-xmldsig@w3.org>, "Martin J. Duerst" <duerst@w3.org>, "John Boyer" <jboyer@PureEdge.com>
Message-Id: <200102210559.AAA0000017376@torque.pothole.com>
I generally agree with your comments but have added a few more...

From:  "Joseph M. Reagle Jr." <reagle@w3.org>
Message-Id:  <4.3.2.7.2.20010214184745.00b18f08@rpcp.mit.edu>
Date:  Wed, 14 Feb 2001 19:45:33 -0500
To:  "IETF/W3C XML-DSig WG" <w3c-ietf-xmldsig@w3.org>
Cc:  "Martin J. Duerst" <duerst@w3.org>, "John Boyer"  <jboyer@PureEdge.com>

>Here are my comments that can be combined with others, if made, before 
>forwarding them on to the I18N groups.
>
>__
>
>http://www.w3.org/TR/2001/WD-charmod-20010126/
>
>I'm very glad to see this specification advanced as it is a very useful 
>reference -- and educational tool for myself at least. One would think 
>representing characters is easy, though it's tricky! Consequently, my 
>comments are mostly editorial and relate to any confusions I experienced as 
>a reader and could easily be remedied. A few references are made with 
>respect to sections that realte to XML Signature, but these issues have been 
>largely addressed by the last call of the XML Signature WG's documents: Core 
>and Canonical XML.
>
>>1.1 Goals and Scope
>>    All W3C specifications have to conform to this document (see section
>>    [57]2 Conformance). Authors of other specifications (for example, IETF
>>    specifications) are strongly encouraged to take guidance from it.
>
>As an aside, while I strongly support this goal, this sort of requirement is 
>atypical and maybe should sit somewhere else in part of the W3C 
>process/guide which is capable of enforcing it?
>
>
>>3.1.2 Units of a Writing System, and Units of Aural Rendering
>
>Please define phoneme, (as distinct from meaning), and syllabaries.
>
>
>>3.1.3 Units of Visual Rendering
>>[Unicode] requires that characters are stored and interchanged in logical 
>>order.
>
>Please define "logical order" (or cite definition).

I agree this needs some clarification.  Presumably it means the order
in which the characters are read by a knowledgable person which is
independent of the order they are printed which could be left to right
or right to left depending on language or even either at the choice of
the writer for some things like Egyptian hieroglyphics.

>>3.1.5 Units of Collation
>>Software developers MUST NOT merely use a one-to-one mapping as their 
>>string-compare function, as in sorting operations.
>
>What are you suggesting they do? Relying upon human context to determine 
>order seems rather haphazard. For instance, how do you sort the words in an 
>English document which contains excerpts from a Spanish document containing 
>sequences such as "ch" and "ll" which are considered atomic collation units 
>in their native document, but not the document in which they are in?
>
>
>>3.2 Digital Representation of Characters
>>3. To enable use in computers, a suitable base datatype is identified (such 
>>as a byte, a 16-bit wyde or other) and a character encoding form (CEF) is 
>>used, which encodes the abstract integers of a CCS into sequences of the 
>>code units of the base datatype.
>
>Note "wyde" typo. Much of this summary is fairly easy to understand and is 
>demonstrated in Appendix A. However, the distinction between CEF and CES is 
>not very clear and might merit an example -- if it can be done simply, 
>getting in to endian and BOM might confuse the case...

3.5 Reference Processing Model
What does "arbitrarily restrict the range of characters that can be
used" mean?  Is any non-arbtirary reason good enough for restriction?
For example, is the existence of programming languages with a bias
toward null terminated "strings" a good enough reason to prohibit the
code point U+0000 unless it can be guaranteed that no such programming
language will ever be used to process the data?

3.6 Choice and Identification of Character Encodings
I think some reason should be given when UTF-16 is more appropriate
than UTF-8 for APIs.  If one is using UTF-8 on the wire, might it not
be easier to use it everywhere in some application?

>>3.6.1 Character Encoding Identification
>>Because of the layered Web architecture (e.g. formats used over protocols), 
>>there may be multiple and at times conflicting information about character 
>>encoding. Specifications MUST define conflict-resolution mechanisms (e.g. 
>>priorities) for these cases, and implementers and content developers MUST 
>>follow them carefully.
>
>This requirement can be relevant to dsig that there is a type attribute (of 
>type URI) that could identify the encoding of an identified resource being 
>signed. However, the signature text speaks of dsig types, not MIME types 
>though MIME types when represented as a URI could be included:
>
>>http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-Reference
>>4.3.3 The Reference Element
>>. The Type attribute facilitates the processing of referenced data. For 
>>example, while this specification makes no requirements over external data, 
>>an application may wish to signal that the referent is a Manifest.
>
>If someone did use this to describe the MIME type, the dsig spec does not 
>address how to resolve conflicting information and leaves it to the 
>application.

3.6.2 Private Use Code Points
The recommendation that private-use code points be allowed but
prohibition against any mechanism to facilitate private agreements
concerning these code points in any protocol seems bizarre.  Why not
leave it up to protocol designers to determine if they will include a
mechanism for private extensions or negotiation of privately defined
options?

3.7 Character Escaping
While it would be fine to say that the number of ways to escape a
character should be minimized, the statement "There SHOULD be only one
way to escape a character." seems wrong.  Every decent language or
syntax of any power has 2 or 3, from allowing both single and double
quotes, so each can quote the other, to both a single character escape
("absquote") which escapes the following character and a string quote
mechanism, to CDATA sections plus the other mechanisms in XML, etc.  A
recommendation violated by every computer language of reasonable
complexity and power I can think of off hand would require
extraordinarily strong justification which I don't see.

>>4 Early Uniform Normalization
>>4.1 Motivation
>>This document also specifies that normalization is to be performed early 
>>(by the sender) as opposed to late (by the recipient).
>
>Note, the dsig specification RECOMMENDS but does not require the signature 
>be in NFC:
>
>>http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-XML-Canonicalization
>>We RECOMMEND that signature applications create XML content (Signature 
>>elements and their descendents/content) in Normalization Form C [NFC] and 
>>check that any XML being consumed is in that form as well (if not, 
>>signatures may consequently fail to validate).
>
>
>
>>4.3 Responsibility for Normalization
>>Note: The prohibition of normalization by recipients is necessary for 
>>consistency, on which security depends.
>
>DSIG is compliant with this:
>
>>http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-See
>>8.1.3 "See" What is Signed
>>Consequently, while we RECOMMEND all documents operated upon and generated 
>>by signature applications be in [NFC] (otherwise intermediate processors 
>>might unintentionally break the signature) encoding normalizations SHOULD 
>>NOT be done as part of a signature transform, or (to state it another way) 
>>if normalization does occur, the application SHOULD always "see" (operate 
>>over) the normalized form.
>
>
>>8 Character Encoding in URI References
>>This chapter defines how to address this issue in W3C specifications in a 
>>way consistent with the model defined in this document and with deployed 
>>practice.
>
>DSIG is compliant with this, see:
>>http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-URI
>
>__
>Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
>W3C Policy Analyst                mailto:reagle@w3.org
>IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature
>W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/

Thanks,
Donald
===================================================================
 Donald E. Eastlake 3rd                    dee3@torque.pothole.com
 155 Beaver Streeet                         lde008@dma.isg.mot.com
 Milford, MA 01757 USA     +1 508-634-2066(h)   +1 508-261-5434(w)
Received on Wednesday, 21 February 2001 00:59:40 UTC