I18N WG/IG last call comments from Martin J. Duerst on 2000-03-25 (w3c-ietf-xmldsig@w3.org from January to March 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Sat, 25 Mar 2000 12:27:24 +0900
To: w3c-ietf-xmldsig@w3.org
Message-Id: <4.2.0.58.J.20000325081007.03136a50@sh.w3.mag.keio.ac.jp>
Dear XML Signature WG,

The W3C I18N WG and IG have reviewed your last call draft.
Below please find our comments. We look forward to collaborating
with you to resolve them.

With other last calls, we have made the experience that
cross-posting to both groups is most efficient. Unfortunately,
the different setup and anti-spam measures for both lists
do not allow this at the moment. To try to make communication
work nevertheless, the writer of this mail has subscribed to
the xmlsig list. Also, it will be crucial that a detailed
disposition of comments, and new versions of the spec or of
parts of it, are available from the XMLSig WG at an early
stage, and the I18N IG gets informed of them quickly, e.g.
through the XMLSig W3C team contact, to make sure there is
a chance for additional comments.


URIs
----

For URIs and URI references, add a convention as in
http://www.w3.org/TR/charmod/#URIs. (all W3C specs
that use URIs in XML are doing this)


Negotiated content
------------------

Signed objects are indicated using URIs. It is unclear
which object is to be signed in case the URI is content-
negotiated (e.g. language-negotiated). Either there must
be some mechanisms to provide negotiation parameters (difficult
to do in general), or some warnings at least that negotiated
resources should not be signed (which may be an undue
restriction). In any case, this has to be discussed in
the specification.
[It is often the case that each variant has its own specific URI
(e.g. http://www.w3.org/2000/02/ATAG-PressRelease.html vs.
http://www.w3.org/2000/02/ATAG-PressRelease.html.fr,
http://www.w3.org/2000/02/ATAG-PressRelease.html.en, and
http://www.w3.org/2000/02/ATAG-PressRelease.html.ja).
But this cannot be guaranteed.]



MimeType and Charset on Transform
---------------------------------

The Transform element has optional MimeType and Charset
parameters. This raises several questions and problems:

- When obtaining a resource (via <Reference>), the resource
   is supposed to come with its own Type/Charset information.
   If this is not the case, then all kinds of operations
   just cannot be performed. This information is certainly
   available in http, and the 'charset' (encoding) information
   is available for XML even on protocols that do not provide
   meta-information. This raises the question of priority.

- The charset information provided as explained above may in
   some cases be absent (e.g. in many cases for HTML over http),
   and in some cases may be wrong. This MUST NOT provide a
   reason for XMLSig to make the attributes on <Transform>
   have higher priority than the information provided in the
   protocol or in the resource itself. Otherwise, this would
   deteriorate efforts to make absolutely sure this information
   is present and available.

- Content negotiation is an additional reason for
   why the 'charset' information should be taken from the
   resource/protocol itself, rather than from the

- In any case, the priority of the various information sources
   has to be clearly specified. It may also be possible to say
   that an error occurs if the informations don't match.

- Type and charset seem to be on Transform because it is
   unclear how this information would get from one transform
   to the other. Because this information has to get from
   the original resource to the first Transform, a processing
   model that doesn't allow passing such information from one
   Transform to another even if needed seems inappropriate.
   [the spec currently doesn't say what the processing model
    for the sequence of transforms is; making this clear would
    help us a lot to check whether it is appropriate for i18n
    or not [describing a processing model in the spec does not
    mean that all implementations have to follow it; it's okay
    if they do something else provided they produce the same
    results]]
   This should be changed so that these parameters are eliminated
   on <Transform>, and the necessary information is passed from
   transform to transform automatically. The parameters may
   then be moved to <Reference>.

- Some transforms may need other parameters, while many
   transforms won't need Type or charset. At the moment, for
   example, there seems to be no proposed transform that
   would need MimeType. Also, the result of most transforms
   is of a very well defined type and in many cases also of
   a defined 'charset', and having to add this information
   on the next transform is error-prone. If the parameters
   are kept on Transform, it should be stated clearly that
   each transform definition has to define whether they are
   actually used, and (in particular for MimeType) for what.

- We also discussed the question of whether it would be better
   to have both Type and charset in a single attribute, but
   we didn't reach a conclusion.


Character encoding and transcoding
----------------------------------

[Transcoding is the conversion from one character encoding
(charset) to another.]

- 'minimal' canonicalization is required, but it should be made
   very clear that this does not imply that conversion from all
   'charset's to UTF-8 is required. A set of 'charset's for which
   support is required should be defined exactly, e.g. as UTF-8
   and UTF-16. This is the same for other transforms.

- There should be a clear and strong warning and an official
   non-guarantee in the spec about conversions from legacy
   'charset's (i.e. encodings not based on the UCS) into encodings
   based on the UCS (i.e. UTF-8, UTF-16,...). The problem is
   that while for most 'charset's, all transcoder implementations
   produce the same result for almost all characters in these
   'charset's, the number of 'charset's where all transcoding
   implementations behave exactly the same is rather limited.
   As an example, in "Shift_JIS", the 'charset' used on Japanese PCs,
   about 10 to 15 special characters are transcoded differently
   on Windows, on the Mac, and by Java.

- Part of the above warning should be to recommend UTF-8 and UTF-16
   to be used for the original documents, and to recommend to use
   numeric character references (e.g. &#xABCD;) for cases where
   differences are known.

- We discussed the idea of having a transform that just does
   transcoding. This would allow to separate various issues clearly.


Choice of Element names
-----------------------

There is a 'Transforms' element and a 'Transform' element.
This is difficult to distinguing even for English speakers.
For people used to languages without singular/plural distinctions,
it will create even more confusion. We ask you to change 'Transforms'
to something like 'TransformList'.


Individual Transforms
---------------------

- Both Base64 and Quoted-Printable are provided as encoding transforms
   (http://www.w3.org/TR/xmldsig-core/#sec-TransformAlg). For
   Quoted-Printable, the only reason given is "Quoted-printable is
   provided, in addition to base-64, in keeping with the XML support
   of a roughly human readable final format.". We have serious
   doubts that this is really useful in the context at hand;
   if several transform steps are used, it should be enough
   that the data before applying base64 is readable, and in
   all the examples given, base64 is only applied to calculated
   numbers/hex sequences, which are not very readable either way.
   So we request to remove Quoted-Printable.

- XPath transform [Please note: there has already been discussion
   on the list about this with the XSL WG. The comments below are
   included to make sure they can be formally answered.]

   The XMLSig draft contains some provisions to deal with the fact
   that input and output infrastructure (e.g. the equivalent of
   http://www.w3.org/TR/xslt#output) is missing in XPath.
   In particular, the evaluation context includes two variables:
   $exprEncoding: a string containing the character encoding
                  of the XPath expression
   $exprBOM: a string containing the byte order mark for the
             XPath expression; set to the empty string if the document
             containing the XPath expression has no byte order mark.
   and makes these available to the XPath expression. The spec also says:
   "The XPath implementation is expected to convert all strings
    appearing in the XPath expression to the same encoding used by
    the input string prior to making any comparisons."

   This should be changed so that:
   - The spec clearly says that all XPath operations have to
     work as if they were implemented in UCS (to conform to
     http://www.w3.org/TR/xpath#strings). Conversion from
     the input encoding of the XPath expression and from the
     input encoding of the actual object are assumed to occur
     implicitly. This would also resolve the question of what
     lexicographic ordering to use in the output.
   - The encoding of the output be clearly specified as one of the following:
     - always, or by default, UTF-8
     - always, or by default, the encoding of the input object
     - specifiable as a parameter to the 'serialize()' function.
     (We would prefer always UTF-8; this reduces implementation
      effort)
   - The above variables be scrapped. While the information is
     relevant to the XPath engine, it is in no way relevant to
     the XPath expression.


Canonicalization and Text Normalization
---------------------------------------

This is a subject that we had quite some discussion about,
and gained new insights, but we are still not sure whether
we understand its implications, and the appropriate solutions,
in the context of digital signatures.

To make it easier for your side to understand our side, here
some background:

- Unicode/ISO 10646 encode a huge number of characters, some
   of them purely for backwards compatibility (e.g. ANGSTROM
   SIGN, U+212B, which is exactly the same as LATIN CAPITAL
   LETTER A WITH RING ABOVE, U+212B)
- For many accented characters, there are two (or in some cases
   more) ways to represent them (e.g. LATIN SMALL LETTER A WITH GRAVE,
   U+00E0, can also be represented by the sequence of LATIN SMALL LETTER
   A, U+0061, followed by COMBINING GRAVE ACCENT, U+0300)
- Depending on the application, one or the other representation
   could be used, but this leads to problems for the user (because
   both have to look the same) and the infrastructure (because e.g.
   XML parsers, HTTP servers, DNS,... would get much too complicated
   if they would all have to implement these conversions). This
   was recognized as a problem, and the requirements for a solution
   are described in http://www.w3.org/TR/WD-charreq, sections
   2 and 3.
- The solution taken consists of specifying a normalization form
   as close as possible to general current practice, called
   Normalization Form C (NFC). See 
http://www.unicode.org/unicode/reports/tr15/.
   The second part of the solution is to require normalization
   to happen early (ideally at the keyboard driver), so that
   in practice unnormalized data should be very rare on the Web
   and the Internet. See
   http://www.w3.org/TR/1999/WD-charmod/#Normalization and
   http://www.ietf.org/internet-drafts/draft-duerst-i18n-norm-03.txt.
- The benefits of the solution are that most applications do not
   have to deal with this level of normalization. If their input
   is normalized, things will work, if not, it is clear which side
   is to be blamed.
- Signatures have to be more careful in a lot of ways, so the
   above benefits may not apply to signatures; this has to be
   carefully studied.
- For those not familiar with accented characters, it may be
   possible to construct something similar from case equivalence.
   Assume that:
   - Users have no way to distinguish case.
   - Internally, there are two cases (upper/lower).
   - It's not that easy to do the conversion, in any way not that
     easy to be worth bothering every piece of software on the
     Internet with it.
   - This is strengthened by the fact that the frequency of appearance
     of letters (compared to numbers, which don't have this problem)
     is quite low, and that most data around up to today uses lower
     case anyway.
   But please be careful that you understand this analogy fully,
   and don't take the wrong conclusions.


The above background in a first round led us to the conclusion
that resources should be processed by NFC before being signed,
because this would make sure that accidental irrelevant (invisible
to the user) changes to a document wouldn't lead to a failure
of the signature. This, among else, led us to request that NFC
be defined as part of XML Canonicalization.

However, thanks to one of the members of our WG, Masahiro Sekiguchi,
we have discovered the following security problems:

- If NFC is applied before digesting, this gives a higher chance
   that an attacker may find a document with the same digest by
   chance, simply because a larger number of documents is ultimately
   mapped to the same digest. While this potential should be mentioned
   in the security section of your document, the low frequency
   of accented and other relevant characters and the irregularity
   of the transforms seem to make this a rather minor problem.

- Assume that a document contains XML with element names with
   accented characters. Assume that this document is correctly
   normalized. Assume that the signature includes NFC as a transform.
   Now the following attack is possible: An intruder replaces the
   normalized document by a document with some of the element
   names unnormalized. The signature still works. However, an
   XML/DOM processor or an XPath expression may (and in practice
   will) work differently, because the unnormalized element is
   assumed to be different from the normalized one.
   As an example, using the above 'case' analogy, take a document
   <root>
    <amount>$10</amount>
    <amount>$1000</amount>
   <root>
   which is modified by an intruder to look like
   <root>
    <Amount>$10</Amount>
    <amount>$1000</amount>
   <root>
   and combine this with a DOM program that extracts the first
   <amount> and pays somebody that much. After the change by
   the intruder, the amount actually paid is $1000 instead of $10.


The above considerations have led us to the conclusions below.
We hope that you can help us with your experience on security
issues to check them.

- Make sure that the transforms defined do not *require* NFC.
   (any XML processor, and any transforms based on it, are
    always *allowed* to to do normalization, see the definition
    of 'match' in the XML spec).

- Provide a transform that *checks* for NFC. The transform
   fails if the input is not in NFC. If the transform succeds,
   the output is exactly the same as the input. [this is a bit
   different from the average transform, but fits very well
   into the general model].

- Advise users (and provide examples) to use this transform
   after e.g. Canonical XML, to avoid missing interactions
   between numeric character references and NFC.

- Advise users to provide their data in NFC (just to reinforce
   the general recommendation), and stress the importance
   of this in the context of digital signatures.

- Provide a transform that actually does NFC, for cases
   where this is desirable, but add the necessary warnings.



Various comments
----------------

- In 7.1, second numbered list, item 2, characters not representable
   in the encoding chosen should be mentioned.

- Section 4.3.3 The Reference Element, para 3

       "pound symbol"

   This term only has meaning if xml:lang="en-us", i.e.
   it is not understood in places using British English
   or derivations thereof.

- Section 6.6.3.3 Function Library Additions, para 2

       "CDATA sections are replaced by their content"

   This requires the processing to behave as if it uses the UCS.

- Section 6.6.3.3 Function Library Additions, definition of serialize
   function

       "an open angle bracket ... and a close angle bracket ...
       an open double quote ... a close double quote"

   In XML syntax, there is no such thing as an open (or close) double
   quote. It's always the same double quote that is used.

- Section 6.6.3.3 Function Library Additions, definition of serialize
   function

       "replacing ... all illegal characters for the output character
       encoding with hexadecimal character references (e.g. &#x0D;)"

   What is the rule for leading zeros?

- Section 7.0 XML Canonicalization and Syntax Constraint
   Considerations, para 5

       "should yield output in a specific fixed character set."

       "that character set is UTF-8."

   The spec uses the term "character set" very loosely. Please
   use the term 'character encoding' when speaking about the concept
   identified by a 'charset' parameter value. 'character set' is
   ambiguous (see http://www.w3.org/MarkUp/html-spec/charset-harmful).

- References

   A reference to UTF-8 should be added (e.g. RFC 2279).
   The spec should specifically warn about broken "UTF-8".


- General

   The treatment of xml:lang, eg during transforms, is unclear.


==================


Non-i18n comments that were raised by WG/IG members
---------------------------------------------------

- The term 'Minimal' Canonicalization, and the fact that
   it is required, is confusing. It suggests that this transform
   has to be applied for every signature, which is not the case.
   The really minimal transform that has to be applied is the
   null transform.


- Throughout the spec,

       "byte stream" (four occurrences),
       "octet stream" (three occurrences),
       "byte sequence" (one occurrence).

   Decide on one of these?


- Section 2.0 Signature Overview and Examples, para 3

       "XML Signatures are be applied ...".

- Section 3.2 Core Validation, para 2

       "... side affects ..."

- Section 7.0 XML Canonicalization and Syntax Constraint
    Considerations, para 5

       "... all XML standards compliant processors ..."

- Section 8.2 Only What is "Seen" Should be Signed, para 2

       "Those application ..."

- Section 8.2 Only What is "Seen" Should be Signed, para 2

       "... name space ..."

    All other instances use "namespace".

=================

For the I18N WG/IG,

Regards,   Martin.
Received on Friday, 24 March 2000 22:27:26 UTC