- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 17 Apr 2002 18:40:23 +0900
- To: Francois Yergeau <FYergeau@alis.com>
- Cc: charsets <ietf-charsets@iana.org>
Hello Francois, Many thanks for your very quick work! Here are my comments on http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-00.txt. - I prefer to get the .txt version rather than the .html version if you send one before publishing. For I-Ds, the .txt is the real thing. <1> and some other places ISO/IEC 10646-1 defines a multi-octet character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of UTF-8, the object of this memo. While the title of ISO/IEC 10646 includes 'multi-octet', I think this is confusing, because we want to clearly separate characters, their numbers in the UCS, and the actual encoding into octets,... I suggest you remove 'multi-octet' everywhere except for the formal title in the reference, and if necessary replace it with something like 'large'. <13> o The lexicographic sorting order of strings is preserved. Of course this is of limited interest since a sort order based on character numbers is not culturally valid. 'preserved' in respect to what? <14> o The Boyer-Moore fast search algorithm can be used with UTF-8 data. This should be worded more general, at least inserting something like 'and similar algorithms'. <15> o UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e. the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. This should maybe somehow mention the special case of an US-ASCII-only string (which can be easily detected, but...). <16> UTF-8 was originally a project of the X/Open Joint Internationalization Group XOJIG with the objective to specify a File System Safe UCS Transformation Format [FSS_UTF] that is compatible with UNIX systems, supporting multilingual text in a single encoding. The original authors were Gary Miller, Greger Leijonhufvud and John Entenmann. Later, Ken Thompson and Rob Pike did significant work for the formal UTF-8. formal UTF-8 -> formal definition of UTF-8 ? <20> In UTF-8, characters are encoded using sequences of 1 to 6 octets. If the repertoire is restricted to the range U+0000 to U+10FFFF (the Unicode repertoire) I don't like the term 'Unicode repertoire'. But I don't have a better term for the moment, unfortunately. <25> 3. Fill in the bits marked x from the bits of the character number, expressed in binary. Start from the lower-order bits of the character number and put them first in the last octet of the sequence, then the next to last, etc. until all x bits are filled in. This misses one important detail: the sequence in which the bits are filled into a byte. This should be fixed. Maybe we can make things even clearer, as follows: Character number | UTF-8 octet sequence (binary) | (binary) ------------------------------------------------------------------------- 0000000000000000000000000gfedcba | 0gfedcba 000000000000000000000kjihgfedcba | 110kjihg 10fedcba 0000000000000000ponmlkjihgfedcba | 1110ponm 10lkjihg 10fedcba 00000000000utsrqponmlkjihgfedcba | 11110uts 10rqponm 10lkjihg 10fedcba 000000zyxwvutsrqponmlkjihgfedcba | 111110zy 10xwvuts 10rqponm 10lkjihg | 10fedcba 0EDCBAzyxwvutsrqponmlkjihgfedcba | 1111110E 10DCBAzy 10xwvuts 10rqponm | 10lkjihg 10fedcba <32> ISO/IEC 10646 is updated from time to time by publication of amendments and additional parts; similarly, different versions of the Unicode standard are published over time. Each new version obsoletes and replaces the previous one, but implementations, and more significantly data, are not updated instantly. 'different versions' gives the impression that these might be diverging versions. <33> In general, the changes amount to adding new characters, which does not pose particular problems with old data. Amendment 5 to ISO/IEC 10646, however, has moved and expanded the Korean Hangul block, As far as I understand, amendments for ISO standards are numbered separately for each version. So we need to clearly say here that it is Amendments 5 to 10646:1993. Also, saying when that change happened (Ken?) will help bringing things in perspective for the new reader. thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The official justification for allowing such an incompatible change was that no implementations and no data containing Hangul existed, a statement that is likely to be true but remains unprovable. As I personally had an implementation as well as some data (in ET++, so this was also part of Lys), this is provably false. I propose to change this to "The justification for allowing such an incompatible change was that there were no major implementations and no significant amounts of data containing Hangul." <34> New versions, and in particular any incompatible changes, have consequences regarding MIME character encoding labels, to be discussed in section 5. 'character encoding' -> '"charset"' (I fight against the term 'character set' or 'charset' quite a bit, but here, it's the right word to use, because that's the name of the parameter.) 'New versions have consequences' sounds a bit strange. What about: The consequences of versioning on MIME "charset" labels, in particular in the case of incompatible changes, are discussed in Section 5. 5. Byte order mark (BOM) This section needs more work. The 'change log' says that it's mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is much less necessary, and much more of a problem, than for UTF-16. We should clearly say that with IETF protocols, character encodings are always either labeled or fixed, and therefore the BOM SHOULD (and MUST at least for small segments) never be used for UTF-8. And we should clearly give the main argument, namely that it breaks US-ASCII compatibility (US-ASCII encoded as UTF-8 (without a BOM) stays exactly the same, but US-ASCII encoded as UTF-8 with a BOM is different). <35> The Unicode Standard and ISO 10646 define the character "ZERO WIDTH NO-BREAK SPACE" (U+FEFF), which is also known informally as "BYTE ORDER MARK" (abbreviated "BOM"). The latter name hints at a second possible usage of the character, in addition to its normal use as a genuine "ZERO WIDTH NO-BREAK SPACE" within text. This usage, suggested by Unicode section 2.7 and ISO/IEC 10646 Annex H (informative), is to prepend a U+FEFF character to a stream of Unicode characters as a "signature"; a receiver of such a serialized Unicode characters -> UCS characters ? stream may then use the initial character both as a hint that the stream consists of Unicode characters, as a way to recognize which UCS encoding is involved and, with encodings having a multi-octet encoding unit, as a way to recognize the serialization order of the octets. The sentence that ends here is too long. Please split. UTF-8 having a single-octet encoding unit, this last function is useless and the BOM will always appear as the octet sequence EF BB BF. <40> The character sequence representing the Hangul characters for the Korean word "hangugo" (U+D55C, U+AD6D, U+C5B4) is encoded in UTF-8 as follows: Please say that this word means Korean (language) in Korean. And it should probably be spelled hangugeo. <41> The character sequence representing the Han characters for the Japanese word "nihongo" (U+65E5, U+672C, U+8A9E) is encoded in UTF-8 as follows: Please say that nihongo means Japanese (lanugage). <42> The character U+233B4 (a Chinese character meaning 'stump of tree'), prepended with a UTF-8 BOM, is encoded in UTF-8 as follows: Please don't give an example of a bad practice. <43> This memo is meant to serve as the basis for registration of a MIME character set parameter (charset) [RFC2978]. Obviously, UTF-8 is already registered. So I would reword this a bit, maybe starting "This memo serves as the basis for the registration of...". Then probably add an IANA consideration section where you say: "Please update the reference for UTF-8 to point to this memo." or so. 8. Security Considerations - Most of the attacks described have actually taken place. I think some 'might's and 'could's should be changed so that it's clearer that these are very realistic threats. - It might be a good idea, here or somewhere else in the document, to provide some regular expressions that fully check UTF-8 byte sequences. Here is one from the W3C validator, in Perl (because Perl allows spaces, this is rather readable :-): s/ [\x00-\x7F] # ASCII | [\xC2-\xDF] [\x80-\xBF] # non-overlong 2-byte sequences | \xE0[\xA0-\xBF] [\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte sequences | \xED[\x80-\x9F] [\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF] [\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3] [\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 //xg; (this substitutes all legal UTF-8 sequences away; if there is something left, it's not UTF-8). This is for planes 0-16 only. Another is the ABNF from the usenet draft: (http://www.ietf.org/internet-drafts/draft-ietf-usefor-article-06.txt) UTF8-xtra-2-head= %xC2-DF UTF8-xtra-3-head= %xE0 %xA0-BF / %xE1-EC %x80-BF / %xED %x80-9F / %xEE-EF %x80-BF UTF8-xtra-4-head= %xF0 %x90-BF / %xF1-F7 %x80-BF UTF8-xtra-5-head= %xF8 %x88-BF / %xF9-FB %x80-BF UTF8-xtra-6-head= %xFC %x84-BF / %xFD %x80-BF UTF8-xtra-tail = %x80-BF UTF8-xtra-char = UTF8-xtra-2-head 1( UTF8-xtra-tail ) / UTF8-xtra-3-head 1( UTF8-xtra-tail ) / UTF8-xtra-4-head 2( UTF8-xtra-tail ) / UTF8-xtra-5-head 3( UTF8-xtra-tail ) / UTF8-xtra-6-head 4( UTF8-xtra-tail ) This doesn't yet include US-ASCII. Either of them probably needs a bit of work. This is for up to 31 bytes. <59> The encoding of your name and address, and Alain's and my name, is messed up. Please don't try to smuggle something around the I-D editor; it's not guaranteed to work. Regards, Martin.
Received on Wednesday, 17 April 2002 05:41:52 UTC