Re: draft-hoffman-utf16-01.txt available

At 19:11 98/12/29 -0800, Paul Hoffman / IMC wrote:
> I didn't see the announcement on this list, but I could have missed it.
> draft-hoffman-utf16-01.txt is now available; a copy is at
> <http://www.imc.org/draft-hoffman-utf16>. Based on comments from many
> people, this draft has some significant changes throughout. Please let me
> know what you think.

<http://www.ietf.org/internet-drafts/draft-hoffman-utf16-01.txt>


Hello Paul,

Here are my comments. In general, I think the draft has been
siginficantly improved.


- Throughout the whole text, "UTF-16" has two meanings, one
  as a way to get from a sequence of numbers in a certain range
  to a sequence of 16-bit values and back, and the other as
  a "charset" value and the related character encoding.
  In this sense, already the title is not very clear. Please
  make it absolutely clear thoughout the draft of which thing
  you speak every time "UTF-16" is mentionned.

- Abstract: There is no abstract. The abstract is what usually
  appears in the ID/RFC anouncements. Is that a problem?

- 1. Intro: The first paragraph is crucial (it may be used as a
  replacement for the abstract).
  "This document specifies" -> "This document describes"
  (the specification is elsewhere, I guess).
  "UTF-16BE, UTF-16LE, and UTF-16" -> "UTF-16BE (big-endian),
  UTF-16LE (little-endian), and UTF-16" (it's very important
  to make sure people, in particular accidental readers, get
  the meaning of these acronyms early on!

- 1.1 Background:
  - Reference RFC 2130 for the definitions of CCS, CES.
  - "UTF-16 ... is a character encoding scheme": This is
    jumping a bit too fast. If we say charset == CCS(s) + CES
    (which is an understanding that is widely used), there
    are two or three CES.

- 1.2 Motivation:
  - "UTF-8 imposes a space penalty for characters ... greater
    than 0x800.": Not true. Correctly, it would say "UTF-8
    imposes a space penalty for characters ... between
    0x800 and 0xFFFF" (planes 1-16 take 4 bytes in both cases).
  - "Also, characters in UTF-8 have varying sizes", "mostly
    uniform in size": There is some truth there, and indeed
    in some cases an advantage, but the way it's worded gives
    more the impression of contradiction and unclearness than
    anything else.
  - "Note, however, that UTF-8 has many other advantages...":
    You may cite my presentation on the properties of UTF-8
    a few Unicode conferences ago. I don't have the reference
    handy here (I'm over Siberia now :-), but you can find it
    from my home page at http://www.w3.org/People/D%C3%BCrst.


- 2.1/2.2 Encoding/Decoding: The first sentence of each section
  should include "from ... to ...", because you are only dealing
  with one of two steps (and the one that is less important,
  because there is less danger of people getting it wrong).

- 3.1 Definition of big-endian and little-endian.
  - "two-octet entities" -> "16-bit entities".
  - Please insert something like "On the other hand" before
    the sixth line of the first paragraph, to make sure the
    reader understands that there is a slight context switch.
  - "Most modern hardware": This has been discussed here and
    elsewhere more than enough, but the above wording is
    unclear, has a high potential for not being true (depending
    on the interpretation of "modern" and "hardware"), and
    also has a potential to look quite a bit strange in a few
    years. I would suggets to change it to "Most current PC hardware".
    With this, you are on the safe side, and the reader gets the
    right message.

  - Maybe the example 0x0102 can be improved by choosing a different
    number that doesn't lead to trivial byte values?

  - "It is important ... that the beginnig of a stream ...":
    Does the MIME spec speak about streams? If not, I would
    propose to avoid that word.

  - "The contrapositive ... is not always true ...": In general,
    you are right. But this spec should contain rules for when a
    0xFFFE is a BOM, and when it's a ZWNBSP, otherwise we are not
    interoperable, and this sentence should contain a pointer to
    these rules.

  - "Also, some specifications mandate an initial 0xFFFE character
    ... and specify that this signature is not to be removed.":
    All this stuff is blurring the architectural boundary between
    the character layer and the application layer. This is a bad
    idea.

  - The text here should clearly say that the use of a BOM at the
    beginning may result in desinchronization of various parts of
    the system. For mail, that's maybe not so much of a problem,
    but more for other systems (e.g. HTTP) that are also based on
    MIME.

- 3.2/3.3 Serialization in UTF-16BE and UTF-16LE
  You treat the BOM only on the receiver side. What are senders
  supposed to do, put one in or not? My strong suggestion is to
  say that the MUST NOT put one in. Note that for BE and LE, we
  can still define things the way we think it's best, and we
  should use that chance.

- 3.4 Serialization in UTF-16
  "All applications that process text in the "UTF-16" charset
  MUST be able to read at least the first two octets ...": This
  sounds like it is permitted for a recipient to read the first
  two bytes, decide e.g. that it is LE, and then stop because LE
  is not supported. Is this what you intend to say? Or what?

4. Choosing a charset
  - The MUST to use BE/LE if the encoding is known makes very
    much sense in general. And the "and knows the serialization
    of the characters in the text" is superfluous, because given
    the rules for encoding of "UTF-16", it's rather impossible
    to do anything without knowing that.
    However, given that some software already accepts "UTF-16",
    but doesn't know "UTF-16BE" and "UTF-16LE", just simply using
    a MUST may not be appropriate.

  - "should be sure the text is labelled" -> "should make sure ..."

  - "Text-creating programs": This assumes a simple model of a
    mail software with some built-in or connected text editor.
    This should be reworded to apply to a wider range of models.

5. Examples
   - I suggest you remove (or label as inappropriate) the example
     of UTF-16BE with a BOM, and put in examples of big-endian
     with and without a BOM for "UTF-16".

6. Versions
   - "Each new version obsoletes and replaces": I think the
     word "obsoletes" is much too strong here.

   - "no implementations and no data ... likely to be true":
     This is definetly false! I had (and still have somewhere)
     an implementation of the old Hangul encoding and some data
     such as keyboard tables and test data.

   - There is some overlap between "6. Versions" and the appendix.
     I suggest that you have a look at it again, decides what
     needs to go where, and clearly separate things.

7. Security considerations

   - "surrogate pairs that have illegal values": This is not
     the only case of illegal values, a single surrogate is
     also illegal.


- I'm still working on checking the conversion code (architecture-
  independent writeout of BE or LE); I'll come back to that as soon
  as possible. I think it's very important that we provide this
  to help people write more stable code.

Hope this helps. Looking forwards to your questions
and comments,


Regards,   Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org

Received on Monday, 1 February 1999 04:47:13 UTC