Re: draft-yergeau-rfc2279bis-04.txt... from Kent Karlsson on 2003-03-10 (ietf-charsets@w3.org from January to March 2003)

From: Kent Karlsson <kentk@md.chalmers.se>
Date: Mon, 10 Mar 2003 15:40:20 +0100
To: ietf-charsets@iana.org
Message-id: <001801c2e712$faca6170$d5d61081@chalmers95a69n>
>From: Patrik Fältström <paf@cisco.com>
...
> If no-one have any issues with this, I hereby declare this done, and I 
> will take over from here.

Well, I have some comments on draft-yergeau-rfc2279bis-04.txt.
Except for a few places, I'm not suggesting wording here.
I leave that to the editor.


>   ISO/IEC 10646-1 defines a large character set called the Universal

Refer to (undated) ISO/IEC 10646 (not ISO/IEC 10646-1). 
Add after 10646: "and Unicode".

Reason:
A reference like ISO/IEC 10646 (usually) refers to the latest version,
since it is undated, and (usually) refers to all parts (if in parts).
(Note that the next edition of 10646 will be in just one part,
not two parts as now, and that the UCS repertoire is defined
by the union of the two current parts (and three amendments).)

>   implementers.  Up to the present time, changes in Unicode and
>   amendments and additions to ISO/IEC 10646 have tracked each other, so
>   that the character repertoires and code point assignments have
>   remained in sync.  

"sync" -> "synchrony" (proper English).  This is true only for the
last few years, and applies only to 10646 editions, and Unicode major
versions.  The amendments and minor versions of Unicode have a more
complex relationship. Suggested new text: "Unicode and ISO/IEC 10646
are kept synchronised, so that the character repertoires and code
point assignments are the same for similarly timed versions/editions
of these standards."

>   assigned to the character in ISO/IEC 10646 (the character number,

Called "code position" in 10646.  The term "character number" is wrong!
Code positions (also when represented in UTF-8) need not represent
characters (in some version of 10646/Unicode), but may be unassigned
(in some version), or for non-characters.  This does not mean that
such code positions are excluded from being encoded in UTF-8. (So
you need to reformulate a bit more than just correcting the term.)

>   o  The octet values C0, C1, FE and FF never appear. If the range of
>      character numbers is restricted to U+0000..U+10FFFF (the UTF-16
>      accessible range), then the octet values F5..FD also never appear.

Since this restriction is in force also for UTF-8, simply say that
C0, C1, F5-FF never occur in well-formed UTF-8 data.

In addition (in a separate point, for clarity), for well-formed
UTF-8, E0 will not be followed by a byte less than or equal to 9F,
ED will not be followed by a byte greater than or equal to A0, F0
will not be followed by a byte less than or equal to 8F, and F4 will
not be followed by a byte greater than or equal to 90 (hexadecimal).

>   o  The lexicographic sorting order of UTF-8 strings is the same as if

Say "binary order" (nothing about sorting, and not unqualified).

>      ordered by character numbers.  Of course this is of limited

It may be of use for internal data representation purposes (like search
trees).

>   UCS characters are designated by the U+HHHH notation, where HHHH is a
>   string of from 4 to 6 hexadecimal digits representing the character
>   number in ISO/IEC 10646.

...in shortest form (though no less than 4 digits).

>   formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]

This reference should be dated, since the annex numbers/letters may
differ between editions of 10646 (or the annex reference number deleted).

>   Encoding a character to UTF-8 proceeds as follows:

UTF-8 is algorithmic, and does not depend on whether a given code
position is for a *character* or not.  However the given code has to be
a Unicode scalar value (not a surrogate code point, nor above 10FFFF).
But it is an error to try to encode a value that is not a Unicode
scalar value.  The latter should be explicitly pointed out.

>   encoding form (as surrogate pairs) and do not directly represent
>   characters. 

Please maintain the distinction between code positions (scalar values)
and characters.  Not all scalar values represent characters.  I agree
that it is unfortunate that Unicode and 10646 do not use exactly the
same terminology though.

>   Decoding a UTF-8 character proceeds as follows:

Please maintain distinctions between character, scalar value
(code position), and code unit sequence.  Decoding a UTF-8 code
unit sequence can lead to a *sequence* of code positions, some
may be for characters, some for non-characters, and some for
unassigned code positions.

>   Implementations of the decoding algorithm above MUST protect against
>   decoding invalid sequences.  For instance, a naive implementation may
>   decode the overlong UTF-8 sequence C0 80 into the character U+0000,
>   or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
>   invalid sequences may have security consequences or cause other
>   problems.

This can be read as it is ok to silently drop ill-formed subsequences.
But that is definitely not ok, since that can lead to security issues.

>   UTF8-octets = *( UTF8-char )
>
>   UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4

As mentioned, this need not encode a character (so the "-char"
is misleading).

>   UTF8-1      = %x00-7F
>
>   UTF8-2      = %xC2-DF UTF8-tail
>
>   UTF8-3      = %xE0    %xA0-BF UTF8-tail / 
>                 %xE1-EC 2( UTF8-tail ) /
>                 %xED    %x80-9F UTF8-tail /
>                 %xEE-EF 2( UTF8-tail )
>
>   UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) /
>                 %xF1-F3 3( UTF8-tail ) /
>                 %xF4 %x80-8F 2( UTF8-tail )
>
>   UTF8-tail   = %x80-BF

(I've added some newlines and spaces above to make the text somewhat
easier to read.)

>   can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but

Preferably, WORD JOINER should be used as a zero-width no-break space,
not the character ZERO WIDTH NO-BREAK SPACE, despite the name!
This should be taken up early in this section.

>   initial character as a hint that the stream consists of UCS

The initial 3 *octets*, *not* the initial character.  It actually the 
octet sequence that acts as a signature, NOT the character so encoded
per se (decoding it to a character happens later, and the BOM character
cannot act as "signature", it's the same for all encoding forms).

>   It is important to understand that the character U+FEFF appearing at
>   any position other than the beginning of a stream MUST be interpreted
>   with the semantics for the zero-width non-breaking space, and MUST
>   NOT be interpreted as a signature. When interpreted as a signature,
>   the Unicode standard suggests than an initial U+FEFF character may be
>   stripped before processing the text. Such stripping is necessary in
>   some cases (e.g. when concatenating two strings, because otherwise
>   the resulting string may contain an unintended "ZERO WIDTH NO-BREAK

Hmmm, part of the intent of using this particular character as a basis
for signatures, is that a zero-width no-break space would not be very
disruptive even if kept; it has no visible width, it does not signify
any additional break point (so it works like NO BREAK HERE...).  And it
is a slightly smaller security issue if NOT deleted, even if at the
beginning of something (unclear what that something is; consider HTTP
or even more to the point SMTP/MIME (after the headings, and inside
some of the headings...)).

>   U+FEFF in the first position of a stream MAY be interpreted as a
>   zero-width non-breaking space, and is not always a signature. In an
>   attempt at diminishing this uncertainty, Unicode 3.2 adds a new
>   character, U+2060 "WORD JOINER", with exactly the same semantics and
>   usage as U+FEFF except for the signature function, and strongly
>   recommends its exclusive use for expressing word-joining semantics.
>   Eventually, following this recommendation will make it all but
>   certain that any initial U+FEFF is a signature, not an intended "ZERO
>   WIDTH NO-BREAK SPACE".

Move this to earlier in this section.

>   Implementers of UTF-8 need to consider the security aspects of how
>   they handle illegal UTF-8 sequences.  

"illegal" -> "ill-formed" (through-out); There is no point in having
THREE terminologies for the same things.  Can we please try to converge
to ONE terminology for these Unicode/10646 matters.  Similarly,
"byte" -> "octet" through-out.

It is not just ill-formed UTF-8 sequences that may lead to security
problems.  Here's a list (no claim of completeness):

 ILL-FORMED UTF-8 SUBSEQUENCES (that aren't part of well-formed
 UTF-8 subsequences):
  These MUST NOT be silently deleted by a decoder.
  The error indication for ill-formed subsequences
  should not be easy to ignore.
  Note that there are ill-formed subsequences
  that never have been properly decodable to code
  points, e.g. too short sequences, and isolated
  continuation octets.  Note that, e.g., ".<ill-fmd>." -> ".."
  may cause a security problem.
  (The editor is of course free to expand on this,
  I'm NOT suggesting wording here!)

 BOM:
  MUST NOT be deleted by a decoder. (See the next point.)

 NON-CHARACTERS, UNASSIGNED, DEFAULT_IGNORABLES incl. ZWNBSP:
  These MUST NOT be deleted by a decoder.  Deleting
  any character may, like replacing ill-formed sequenced
  with the empty string of characters, lead to security
  problems.  Non-characters are not intended to be
  exchanged across "systems" (with an undefined meaning
  of "system"), but that does not imply that a recipient
  may delete them immediately when received, esp. when in
  a security sensitive context.

 NLF (new line function), SPACES, OTHER SYNTACTIC ELEMENTS:
  Note that security scanning may assume a different
  syntax than a subsequent interpreter. E.g. if the
  "scanner" only recognises SPACE and TAB as "whitespace",
  the division into elements may be quite different
  if a subsequent interpreter recognises also all
  category Zs characters as spaces.  This problem is not
  new to Unicode, since the problem with CR, LF, CR+LF,
  NEL, FF, and VT is present also for other character
  encodings, but Unicode adds PS (PARAGRAPH SEPARATOR)
  and LS (LINE SEPARATOR) to this set.

 NORMALISATION
  If a security scan does not handle normalisation
  in the same way as a subsequent interpreter, security
  problems may arise.  E.g. if "Kovert" is an access
  restricted folder, if the source uses the KELVIN SIGN
  for the K, the security scan may miss that "Kovert"
  is accessed if the actual access uses any normalisation.

 CHECKSUMS and DIGITAL SIGNATURES
  Any change will corrupt the document signed (or just
  checksummed), so any kind of normalisation (e.g.)
  must be done before the signature (or checksum) is
  computed, otherwise the signature/checksum will be
  invalidated.  Similarly for the insertion/deletion
  of BOM, non-characters, etc.

 FALLBACKS
  Fallbacks should not be used within the UCS (but
  that is sometimes done anyway), but is commonly used
  when converting to another character encoding. Note that
  in either case, fallbacks may give rise to security problems.
  E.g., if <two-dot leader> is "felled back" to "..", or if 
  any unconvertable character is felled back to "?" (which in
  some circumstances is interpreted as matching any character).

>   a parser which performs security-critical validity checks against the
>   UTF-8 encoded form of its input, but interprets certain illegal octet
>   sequences as characters.

"characters" -> "zero or more characters".  A UTF-8 decoder MUST NOT
silently remove ill-formed subsequences, but MAY nevertheless (still
together with an error indication) replace each with a code that
indicates that there may have been an encoding error (together
with another error indication). Note that U+001A SUBSTITUTE is a control
code which exist for that specific purpose (but for any encoding).

>   o  Made Unicode the source of the normative definition of UTF-8,
>      keeping ISO/IEC 10646 as the reference for characters.

I fail to see the point in this forking.

>   [UNICODE]  The Unicode Consortium, "The Unicode Standard -- Version
>              3.2",  defined by The Unicode Standard, Version 3.0
>              (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5),
>              as amended by the Unicode Standard Annex #27: Unicode 3.1
>              (see http://www.unicode.org/reports/tr27) and by the
>              Unicode Standard Annex #28: Unicode 3.2 (see 
>              http://www.unicode.org/reports/tr28), March 2002, 
>              .

 Comma, period?  Refer also to UAX 13, Newline recommendations,
 UAX 15, Normalisation, and the Unicode character database.

>   [CESU-8]   Phipps, T., "Compatibility Encoding Scheme for UTF-16:
>              8-Bit (CESU-8)", UTR 26, April 2002, 
>              .

 Comma, period?


   /Kent Karlsson
Received on Monday, 31 March 2003 03:36:54 UTC