- From: Kent Karlsson <kentk@md.chalmers.se>
- Date: Mon, 10 Mar 2003 15:56:41 +0100
- To: <ietf-charsets@w3.org>
>From: Patrik Fältström <paf@cisco.com> ... > If no-one have any issues with this, I hereby declare this done, and I > will take over from here. Well, I have some comments on draft-yergeau-rfc2279bis-04.txt. Except for a few places, I'm not suggesting wording here. I leave that to the editor. > ISO/IEC 10646-1 defines a large character set called the Universal Refer to (undated) ISO/IEC 10646 (not ISO/IEC 10646-1). Add after 10646: "and Unicode". Reason: A reference like ISO/IEC 10646 (usually) refers to the latest version, since it is undated, and (usually) refers to all parts (if in parts). (Note that the next edition of 10646 will be in just one part, not two parts as now, and that the UCS repertoire is defined by the union of the two current parts (and three amendments).) > implementers. Up to the present time, changes in Unicode and > amendments and additions to ISO/IEC 10646 have tracked each other, so > that the character repertoires and code point assignments have > remained in sync. "sync" -> "synchrony" (proper English). This is true only for the last few years, and applies only to 10646 editions, and Unicode major versions. The amendments and minor versions of Unicode have a more complex relationship. Suggested new text: "Unicode and ISO/IEC 10646 are kept synchronised, so that the character repertoires and code point assignments are the same for similarly timed versions/editions of these standards." > assigned to the character in ISO/IEC 10646 (the character number, Called "code position" in 10646. The term "character number" is wrong! Code positions (also when represented in UTF-8) need not represent characters (in some version of 10646/Unicode), but may be unassigned (in some version), or for non-characters. This does not mean that such code positions are excluded from being encoded in UTF-8. (So you need to reformulate a bit more than just correcting the term.) > o The octet values C0, C1, FE and FF never appear. If the range of > character numbers is restricted to U+0000..U+10FFFF (the UTF-16 > accessible range), then the octet values F5..FD also never appear. Since this restriction is in force also for UTF-8, simply say that C0, C1, F5-FF never occur in well-formed UTF-8 data. In addition (in a separate point, for clarity), for well-formed UTF-8, E0 will not be followed by a byte less than or equal to 9F, ED will not be followed by a byte greater than or equal to A0, F0 will not be followed by a byte less than or equal to 8F, and F4 will not be followed by a byte greater than or equal to 90 (hexadecimal). > o The lexicographic sorting order of UTF-8 strings is the same as if Say "binary order" (nothing about sorting, and not unqualified). > ordered by character numbers. Of course this is of limited It may be of use for internal data representation purposes (like search trees). > UCS characters are designated by the U+HHHH notation, where HHHH is a > string of from 4 to 6 hexadecimal digits representing the character > number in ISO/IEC 10646. ...in shortest form (though no less than 4 digits). > formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646] This reference should be dated, since the annex numbers/letters may differ between editions of 10646 (or the annex reference number deleted). > Encoding a character to UTF-8 proceeds as follows: UTF-8 is algorithmic, and does not depend on whether a given code position is for a *character* or not. However the given code has to be a Unicode scalar value (not a surrogate code point, nor above 10FFFF). But it is an error to try to encode a value that is not a Unicode scalar value. The latter should be explicitly pointed out. > encoding form (as surrogate pairs) and do not directly represent > characters. Please maintain the distinction between code positions (scalar values) and characters. Not all scalar values represent characters. I agree that it is unfortunate that Unicode and 10646 do not use exactly the same terminology though. > Decoding a UTF-8 character proceeds as follows: Please maintain distinctions between character, scalar value (code position), and code unit sequence. Decoding a UTF-8 code unit sequence can lead to a *sequence* of code positions, some may be for characters, some for non-characters, and some for unassigned code positions. > Implementations of the decoding algorithm above MUST protect against > decoding invalid sequences. For instance, a naive implementation may > decode the overlong UTF-8 sequence C0 80 into the character U+0000, > or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding > invalid sequences may have security consequences or cause other > problems. This can be read as it is ok to silently drop ill-formed subsequences. But that is definitely not ok, since that can lead to security issues. > UTF8-octets = *( UTF8-char ) > > UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 As mentioned, this need not encode a character (so the "-char" is misleading). > UTF8-1 = %x00-7F > > UTF8-2 = %xC2-DF UTF8-tail > > UTF8-3 = %xE0 %xA0-BF UTF8-tail / > %xE1-EC 2( UTF8-tail ) / > %xED %x80-9F UTF8-tail / > %xEE-EF 2( UTF8-tail ) > > UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / > %xF1-F3 3( UTF8-tail ) / > %xF4 %x80-8F 2( UTF8-tail ) > > UTF8-tail = %x80-BF (I've added some newlines and spaces above to make the text somewhat easier to read.) > can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but Preferably, WORD JOINER should be used as a zero-width no-break space, not the character ZERO WIDTH NO-BREAK SPACE, despite the name! This should be taken up early in this section. > initial character as a hint that the stream consists of UCS The initial 3 *octets*, *not* the initial character. It actually the octet sequence that acts as a signature, NOT the character so encoded per se (decoding it to a character happens later, and the BOM character cannot act as "signature", it's the same for all encoding forms). > It is important to understand that the character U+FEFF appearing at > any position other than the beginning of a stream MUST be interpreted > with the semantics for the zero-width non-breaking space, and MUST > NOT be interpreted as a signature. When interpreted as a signature, > the Unicode standard suggests than an initial U+FEFF character may be > stripped before processing the text. Such stripping is necessary in > some cases (e.g. when concatenating two strings, because otherwise > the resulting string may contain an unintended "ZERO WIDTH NO-BREAK Hmmm, part of the intent of using this particular character as a basis for signatures, is that a zero-width no-break space would not be very disruptive even if kept; it has no visible width, it does not signify any additional break point (so it works like NO BREAK HERE...). And it is a slightly smaller security issue if NOT deleted, even if at the beginning of something (unclear what that something is; consider HTTP or even more to the point SMTP/MIME (after the headings, and inside some of the headings...)). > U+FEFF in the first position of a stream MAY be interpreted as a > zero-width non-breaking space, and is not always a signature. In an > attempt at diminishing this uncertainty, Unicode 3.2 adds a new > character, U+2060 "WORD JOINER", with exactly the same semantics and > usage as U+FEFF except for the signature function, and strongly > recommends its exclusive use for expressing word-joining semantics. > Eventually, following this recommendation will make it all but > certain that any initial U+FEFF is a signature, not an intended "ZERO > WIDTH NO-BREAK SPACE". Move this to earlier in this section. > Implementers of UTF-8 need to consider the security aspects of how > they handle illegal UTF-8 sequences. "illegal" -> "ill-formed" (through-out); There is no point in having THREE terminologies for the same things. Can we please try to converge to ONE terminology for these Unicode/10646 matters. Similarly, "byte" -> "octet" through-out. It is not just ill-formed UTF-8 sequences that may lead to security problems. Here's a list (no claim of completeness): ILL-FORMED UTF-8 SUBSEQUENCES (that aren't part of well-formed UTF-8 subsequences): These MUST NOT be silently deleted by a decoder. The error indication for ill-formed subsequences should not be easy to ignore. Note that there are ill-formed subsequences that never have been properly decodable to code points, e.g. too short sequences, and isolated continuation octets. Note that, e.g., ".<ill-fmd>." -> ".." may cause a security problem. (The editor is of course free to expand on this, I'm NOT suggesting wording here!) BOM: MUST NOT be deleted by a decoder. (See the next point.) NON-CHARACTERS, UNASSIGNED, DEFAULT_IGNORABLES incl. ZWNBSP: These MUST NOT be deleted by a decoder. Deleting any character may, like replacing ill-formed sequenced with the empty string of characters, lead to security problems. Non-characters are not intended to be exchanged across "systems" (with an undefined meaning of "system"), but that does not imply that a recipient may delete them immediately when received, esp. when in a security sensitive context. NLF (new line function), SPACES, OTHER SYNTACTIC ELEMENTS: Note that security scanning may assume a different syntax than a subsequent interpreter. E.g. if the "scanner" only recognises SPACE and TAB as "whitespace", the division into elements may be quite different if a subsequent interpreter recognises also all category Zs characters as spaces. This problem is not new to Unicode, since the problem with CR, LF, CR+LF, NEL, FF, and VT is present also for other character encodings, but Unicode adds PS (PARAGRAPH SEPARATOR) and LS (LINE SEPARATOR) to this set. NORMALISATION If a security scan does not handle normalisation in the same way as a subsequent interpreter, security problems may arise. E.g. if "Kovert" is an access restricted folder, if the source uses the KELVIN SIGN for the K, the security scan may miss that "Kovert" is accessed if the actual access uses any normalisation. CHECKSUMS and DIGITAL SIGNATURES Any change will corrupt the document signed (or just checksummed), so any kind of normalisation (e.g.) must be done before the signature (or checksum) is computed, otherwise the signature/checksum will be invalidated. Similarly for the insertion/deletion of BOM, non-characters, etc. FALLBACKS Fallbacks should not be used within the UCS (but that is sometimes done anyway), but is commonly used when converting to another character encoding. Note that in either case, fallbacks may give rise to security problems. E.g., if <two-dot leader> is "felled back" to "..", or if any unconvertable character is felled back to "?" (which in some circumstances is interpreted as matching any character). > a parser which performs security-critical validity checks against the > UTF-8 encoded form of its input, but interprets certain illegal octet > sequences as characters. "characters" -> "zero or more characters". A UTF-8 decoder MUST NOT silently remove ill-formed subsequences, but MAY nevertheless (still together with an error indication) replace each with a code that indicates that there may have been an encoding error (together with another error indication). Note that U+001A SUBSTITUTE is a control code which exist for that specific purpose (but for any encoding). > o Made Unicode the source of the normative definition of UTF-8, > keeping ISO/IEC 10646 as the reference for characters. I fail to see the point in this forking. > [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version > 3.2", defined by The Unicode Standard, Version 3.0 > (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), > as amended by the Unicode Standard Annex #27: Unicode 3.1 > (see http://www.unicode.org/reports/tr27) and by the > Unicode Standard Annex #28: Unicode 3.2 (see > http://www.unicode.org/reports/tr28), March 2002, > . Comma, period? Refer also to UAX 13, Newline recommendations, UAX 15, Normalisation, and the Unicode character database. > [CESU-8] Phipps, T., "Compatibility Encoding Scheme for UTF-16: > 8-Bit (CESU-8)", UTR 26, April 2002, > . Comma, period? /Kent Karlsson
Received on Monday, 10 March 2003 09:59:11 UTC