- From: Martin J. Duerst <duerst@w3.org>
- Date: Sun, 31 Jan 1999 17:40:13 +0900
- To: Paul Hoffman / IMC <phoffman@imc.org>
- Cc: ietf-charsets@iana.org
At 19:11 98/12/29 -0800, Paul Hoffman / IMC wrote: > I didn't see the announcement on this list, but I could have missed it. > draft-hoffman-utf16-01.txt is now available; a copy is at > <http://www.imc.org/draft-hoffman-utf16>. Based on comments from many > people, this draft has some significant changes throughout. Please let me > know what you think. <http://www.ietf.org/internet-drafts/draft-hoffman-utf16-01.txt> Hello Paul, Here are my comments. In general, I think the draft has been siginficantly improved. - Throughout the whole text, "UTF-16" has two meanings, one as a way to get from a sequence of numbers in a certain range to a sequence of 16-bit values and back, and the other as a "charset" value and the related character encoding. In this sense, already the title is not very clear. Please make it absolutely clear thoughout the draft of which thing you speak every time "UTF-16" is mentionned. - Abstract: There is no abstract. The abstract is what usually appears in the ID/RFC anouncements. Is that a problem? - 1. Intro: The first paragraph is crucial (it may be used as a replacement for the abstract). "This document specifies" -> "This document describes" (the specification is elsewhere, I guess). "UTF-16BE, UTF-16LE, and UTF-16" -> "UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16" (it's very important to make sure people, in particular accidental readers, get the meaning of these acronyms early on! - 1.1 Background: - Reference RFC 2130 for the definitions of CCS, CES. - "UTF-16 ... is a character encoding scheme": This is jumping a bit too fast. If we say charset == CCS(s) + CES (which is an understanding that is widely used), there are two or three CES. - 1.2 Motivation: - "UTF-8 imposes a space penalty for characters ... greater than 0x800.": Not true. Correctly, it would say "UTF-8 imposes a space penalty for characters ... between 0x800 and 0xFFFF" (planes 1-16 take 4 bytes in both cases). - "Also, characters in UTF-8 have varying sizes", "mostly uniform in size": There is some truth there, and indeed in some cases an advantage, but the way it's worded gives more the impression of contradiction and unclearness than anything else. - "Note, however, that UTF-8 has many other advantages...": You may cite my presentation on the properties of UTF-8 a few Unicode conferences ago. I don't have the reference handy here (I'm over Siberia now :-), but you can find it from my home page at http://www.w3.org/People/D%C3%BCrst. - 2.1/2.2 Encoding/Decoding: The first sentence of each section should include "from ... to ...", because you are only dealing with one of two steps (and the one that is less important, because there is less danger of people getting it wrong). - 3.1 Definition of big-endian and little-endian. - "two-octet entities" -> "16-bit entities". - Please insert something like "On the other hand" before the sixth line of the first paragraph, to make sure the reader understands that there is a slight context switch. - "Most modern hardware": This has been discussed here and elsewhere more than enough, but the above wording is unclear, has a high potential for not being true (depending on the interpretation of "modern" and "hardware"), and also has a potential to look quite a bit strange in a few years. I would suggets to change it to "Most current PC hardware". With this, you are on the safe side, and the reader gets the right message. - Maybe the example 0x0102 can be improved by choosing a different number that doesn't lead to trivial byte values? - "It is important ... that the beginnig of a stream ...": Does the MIME spec speak about streams? If not, I would propose to avoid that word. - "The contrapositive ... is not always true ...": In general, you are right. But this spec should contain rules for when a 0xFFFE is a BOM, and when it's a ZWNBSP, otherwise we are not interoperable, and this sentence should contain a pointer to these rules. - "Also, some specifications mandate an initial 0xFFFE character ... and specify that this signature is not to be removed.": All this stuff is blurring the architectural boundary between the character layer and the application layer. This is a bad idea. - The text here should clearly say that the use of a BOM at the beginning may result in desinchronization of various parts of the system. For mail, that's maybe not so much of a problem, but more for other systems (e.g. HTTP) that are also based on MIME. - 3.2/3.3 Serialization in UTF-16BE and UTF-16LE You treat the BOM only on the receiver side. What are senders supposed to do, put one in or not? My strong suggestion is to say that the MUST NOT put one in. Note that for BE and LE, we can still define things the way we think it's best, and we should use that chance. - 3.4 Serialization in UTF-16 "All applications that process text in the "UTF-16" charset MUST be able to read at least the first two octets ...": This sounds like it is permitted for a recipient to read the first two bytes, decide e.g. that it is LE, and then stop because LE is not supported. Is this what you intend to say? Or what? 4. Choosing a charset - The MUST to use BE/LE if the encoding is known makes very much sense in general. And the "and knows the serialization of the characters in the text" is superfluous, because given the rules for encoding of "UTF-16", it's rather impossible to do anything without knowing that. However, given that some software already accepts "UTF-16", but doesn't know "UTF-16BE" and "UTF-16LE", just simply using a MUST may not be appropriate. - "should be sure the text is labelled" -> "should make sure ..." - "Text-creating programs": This assumes a simple model of a mail software with some built-in or connected text editor. This should be reworded to apply to a wider range of models. 5. Examples - I suggest you remove (or label as inappropriate) the example of UTF-16BE with a BOM, and put in examples of big-endian with and without a BOM for "UTF-16". 6. Versions - "Each new version obsoletes and replaces": I think the word "obsoletes" is much too strong here. - "no implementations and no data ... likely to be true": This is definetly false! I had (and still have somewhere) an implementation of the old Hangul encoding and some data such as keyboard tables and test data. - There is some overlap between "6. Versions" and the appendix. I suggest that you have a look at it again, decides what needs to go where, and clearly separate things. 7. Security considerations - "surrogate pairs that have illegal values": This is not the only case of illegal values, a single surrogate is also illegal. - I'm still working on checking the conversion code (architecture- independent writeout of BE or LE); I'll come back to that as soon as possible. I think it's very important that we provide this to help people write more stable code. Hope this helps. Looking forwards to your questions and comments, Regards, Martin. #-#-# Martin J. Du"rst, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org
Received on Monday, 1 February 1999 04:47:13 UTC