xmldsig Charmod Comments from Joseph M. Reagle Jr. on 2001-02-22 (www-i18n-comments@w3.org from February 2001)

From: Joseph M. Reagle Jr. <reagle@w3.org>
Date: Thu, 22 Feb 2001 15:04:26 -0500
To: www-i18n-comments@w3.org
Cc: Misha Wolf <Misha.Wolf@reuters.com>, "IETF/W3C XML-DSig WG" <w3c-ietf-xmldsig@w3.org>
Message-Id: <4.3.2.7.2.20010222150238.02c0d198@rpcp.mit.edu>
Reviewers: Don Eastlake and Joseph Reagle

http://www.w3.org/TR/2001/WD-charmod-20010126/

We're very glad to see this specification advanced as it is a very
useful reference -- and educational tool for understanding issues of
chracter representation. Consequently, our comments are mostly editorial
and relate to confusions experienced as a reader. A few references
are made with respect to sections that relate to XML Signature, but
these issues have been largely addressed by the last call of the XML
Signature WG's documents: Core and Canonical XML.

__

1.1 Goals and Scope
     All W3C specifications have to conform to this document (see section
     [57]2 Conformance). Authors of other specifications (for example, IETF
     specifications) are strongly encouraged to take guidance from it.

As an aside, while we strongly support this goal, this sort of requirement is
atypical and maybe should sit somewhere else in part of the W3C
process/guide which is capable of enforcing it?

__

3.1.2 Units of a Writing System, and Units of Aural Rendering

Please define phoneme, (as distinct from meaning), and syllabaries.

__

3.1.3 Units of Visual Rendering
[Unicode] requires that characters are stored and interchanged in logical
order.

Please define "logical order" (or cite definition). Presumably it means
the order in which the characters are read by a knowledgable person
which is independent of the order they are printed which could be left
to right or right to left depending on language or even either at the
choice of the writer for some things like Egyptian hieroglyphics.

__


3.1.5 Units of Collation
Software developers MUST NOT merely use a one-to-one mapping as their
string-compare function, as in sorting operations.

What are you suggesting they do? Relying upon human context to determine
order seems rather haphazard. For instance, how do you sort the words in an
English document which contains excerpts from a Spanish document containing
sequences such as "ch" and "ll" which are considered atomic collation units
in their native document, but not the document in which they are in?

__

3.2 Digital Representation of Characters
3. To enable use in computers, a suitable base datatype is identified (such
as a byte, a 16-bit wyde or other) and a character encoding form (CEF) is
used, which encodes the abstract integers of a CCS into sequences of the
code units of the base datatype.

Note "wyde" typo. Much of this summary is fairly easy to understand and is
demonstrated in Appendix A. However, the distinction between CEF and CES is
not very clear and might merit an example -- if it can be done simply,
getting in to endian and BOM might confuse the case...

__

3.5 Reference Processing Model

What does "arbitrarily restrict the range of characters that can be
used" mean?  Is any non-arbtirary reason good enough for restriction?
For example, is the existence of programming languages with a bias
toward null terminated "strings" a good enough reason to prohibit the
code point U+0000 unless it can be guaranteed that no such programming
language will ever be used to process the data?

__

3.6 Choice and Identification of Character Encodings

I think some reason should be given when UTF-16 is more appropriate
than UTF-8 for APIs.  If one is using UTF-8 on the wire, might it not
be easier to use it everywhere in some application?

__

3.6.1 Character Encoding Identification
Because of the layered Web architecture (e.g. formats used over protocols),
there may be multiple and at times conflicting information about character
encoding. Specifications MUST define conflict-resolution mechanisms (e.g.
priorities) for these cases, and implementers and content developers MUST
follow them carefully.

This requirement can be relevant to dsig that there is a type attribute (of
type URI) that could identify the encoding of an identified resource being
signed. However, the signature text speaks of dsig types, not MIME types
though MIME types when represented as a URI could be included:

http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-Reference
4.3.3 The Reference Element
. The Type attribute facilitates the processing of referenced data. For
example, while this specification makes no requirements over external data,
an application may wish to signal that the referent is a Manifest.

If someone did use this to describe the MIME type, the dsig spec does not
address how to resolve conflicting information and leaves it to the
application.

__

3.6.2 Private Use Code Points

The recommendation that private-use code points be allowed but
prohibition against any mechanism to facilitate private agreements
concerning these code points in any protocol seems bizarre.  Why not
leave it up to protocol designers to determine if they will include a
mechanism for private extensions or negotiation of privately defined
options?

__

3.7 Character Escaping

[Eastlake] While it would be fine to say that the number of ways to escape a
character should be minimized, the statement "There SHOULD be only one
way to escape a character." seems wrong.  Every decent language or
syntax of any power has 2 or 3, from allowing both single and double
quotes, so each can quote the other, to both a single character escape
("absquote") which escapes the following character and a string quote
mechanism, to CDATA sections plus the other mechanisms in XML, etc.  A
recommendation violated by every computer language of reasonable
complexity and power I can think of off hand would require
extraordinarily strong justification which I don't see.

[Reagle] I support the "SHOULD" token and find multiple ways to escape a
character adds to complexity.

__

4 Early Uniform Normalization
4.1 Motivation
This document also specifies that normalization is to be performed early
(by the sender) as opposed to late (by the recipient).

Note, the dsig specification RECOMMENDS but does not require the signature
be in NFC:

http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-XML-Canonicalization
We RECOMMEND that signature applications create XML content (Signature
elements and their descendents/content) in Normalization Form C [NFC] and
check that any XML being consumed is in that form as well (if not,
signatures may consequently fail to validate).

__

4.3 Responsibility for Normalization
Note: The prohibition of normalization by recipients is necessary for
consistency, on which security depends.

DSIG is compliant with this:

http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-See
8.1.3 "See" What is Signed
Consequently, while we RECOMMEND all documents operated upon and generated
by signature applications be in [NFC] (otherwise intermediate processors
might unintentionally break the signature) encoding normalizations SHOULD
NOT be done as part of a signature transform, or (to state it another way)
if normalization does occur, the application SHOULD always "see" (operate
over) the normalized form.

__

8 Character Encoding in URI References
This chapter defines how to address this issue in W3C specifications in a
way consistent with the model defined in this document and with deployed
practice.

DSIG is compliant with this, see:
http://www.w3.org/TR/2000/CR-xmldsig-core-20001031/#sec-URI

__

__
Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
W3C Policy Analyst                mailto:reagle@w3.org
IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature
W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/
Received on Thursday, 22 February 2001 16:06:58 UTC