New draft-yergeau-rfc2279bis-05.txt from Francois Yergeau on 2003-06-09 (ietf-charsets@w3.org from April to June 2003)

From: Francois Yergeau <FYergeau@alis.com>
Date: Mon, 09 Jun 2003 16:09:37 -0400
To: ietf-charsets@iana.org
Message-id: <F7D4BDA0E5A1D14B99D32C022AEB7366EEE3BE@alis-2k.alis.domain>
...just submitted to secretariat.

This revision addresses two substantive issues raised by the IESG during
post-last-call evaluation, as well as a few minor points that have shown up
since -04.

Changes from IESG review:
============================================================================
=

One director requested that it be made clear that the ABNF in section 4 is
not normative, both because it is new and untested -- added between Draft
and Standard -- and because RFC 2234 is only Proposed.  Section 4 now begins
with a new para:

   For the convenience of implementors using ABNF, a definition of UTF-8
   in ABNF syntax is given here.

and ends with a new Note:

   NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This
   grammar is believed to describe the same thing as what Unicode
   describes, but does not claim to be authoritative. Implementors are
   urged to rely on the authoritative source, rather than on this ABNF.

============================================================================
=

One director requested additional material in Security Considerations about
the fact that octet-by-octet comparison is not sufficient (the Unicode
normalization issue).  The following has been added at the end of section
10:

   Security may also be impacted by a characteristic of several
   character encodings, including UTF-8: the "same thing" (as far as a
   user can tell) can be represented by several distinct character
   sequences. For instance, an e with acute accent can be represented by
   the precomposed U+00E9 E ACUTE character or by the canonically
   equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though
   UTF-8 provides a single byte sequence for each character sequence,
   the existence of multiple character sequences for "the same thing"
   may have security consequences whenever string matching, indexing,
   searching, sorting, regular expression matching and selection are
   involved.  An example would be string matching of an identifier
   appearing in a credential and in access control list entries.  This
   issue is amenable to solutions based on Unicode Normalization Forms,
   see [UAX15].

together with a new entry in Informative references for "Unicode Standard
Annex #15: Unicode Normalization Forms".


Minor changes:
============================================================================
=

In Introduction, add "code position" to "(the character number, a.k.a. code
point or Unicode scalar value)".

Rationale: "code position" is the 10646 term.

============================================================================
=

In Introduction, change

   o  The octet values C0, C1, FE and FF never appear. If the range of
      character numbers is restricted to U+0000..U+10FFFF (the UTF-16
      accessible range), then the octet values F5..FD also never appear.

to

   o  The octet values C0, C1, and F5 to FF never appear.

Rationale: we do restrict to U+0000..U+10FFFF now, the "If" is superfluous.

============================================================================
=

In Introduction, add "byte-value" to "The lexicographic sorting order of..."

Rationale: clarification, that's what it is.

============================================================================
=

Add Chris Newman to Acknowlegments

Rationale: he had just slipped through the cracks.  With apologies.

============================================================================
=

-- 
François Yergeau
Received on Monday, 9 June 2003 16:14:20 UTC