Comments on draft-yergeau-rfc2279bis-00.txt from Martin Duerst on 2002-04-17 (ietf-charsets@w3.org from April to June 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 17 Apr 2002 18:40:23 +0900
To: Francois Yergeau <FYergeau@alis.com>
Cc: charsets <ietf-charsets@iana.org>
Message-id: <4.2.0.58.J.20020417173356.0316dcc0@localhost>
Hello Francois,

Many thanks for your very quick work!

Here are my comments on
http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-00.txt.


- I prefer to get the .txt version rather than the .html version
   if you send one before publishing. For I-Ds, the .txt is the
   real thing.


<1> and some other places
    ISO/IEC 10646-1 defines a multi-octet character set called the
    Universal Character Set (UCS) which encompasses most of the world's
    writing systems.  Multi-octet characters, however, are not compatible
    with many current applications and protocols, and this has led to the
    development of UTF-8, the object of this memo.

While the title of ISO/IEC 10646 includes 'multi-octet', I think
this is confusing, because we want to clearly separate characters,
their numbers in the UCS, and the actual encoding into octets,...
I suggest you remove 'multi-octet' everywhere except for the
formal title in the reference, and if necessary replace it
with something like 'large'.


<13>
    o  The lexicographic sorting order of strings is preserved.  Of
       course this is of limited interest since a sort order based on
       character numbers is not culturally valid.

'preserved' in respect to what?


<14>
    o  The Boyer-Moore fast search algorithm can be used with UTF-8 data.

This should be worded more general, at least inserting something
like 'and similar algorithms'.


<15>
    o  UTF-8 strings can be fairly reliably recognized as such by a
       simple algorithm, i.e.  the probability that a string of
       characters in any other encoding appears as valid UTF-8 is low,
       diminishing with increasing string length.

This should maybe somehow mention the special case of an US-ASCII-only
string (which can be easily detected, but...).


<16>
    UTF-8 was originally a project of the X/Open Joint
    Internationalization Group XOJIG with the objective to specify a File
    System Safe UCS Transformation Format [FSS_UTF] that is compatible
    with UNIX systems, supporting multilingual text in a single encoding.
    The original authors were Gary Miller, Greger Leijonhufvud and John
    Entenmann.  Later, Ken Thompson and Rob Pike did significant work for
    the formal UTF-8.

formal UTF-8 -> formal definition of UTF-8 ?


<20>
    In UTF-8, characters are encoded using sequences of 1 to 6 octets.
    If the repertoire is restricted to the range U+0000 to U+10FFFF (the
    Unicode repertoire)

I don't like the term 'Unicode repertoire'. But I don't have a better
term for the moment, unfortunately.


<25>
    3.  Fill in the bits marked x from the bits of the character number,
        expressed in binary.  Start from the lower-order bits of the
        character number and put them first in the last octet of the
        sequence, then the next to last, etc.  until all x bits are
        filled in.

This misses one important detail: the sequence in which the bits
are filled into a byte. This should be fixed. Maybe we can
make things even clearer, as follows:


Character number                 |    UTF-8 octet sequence
     (binary)                     |          (binary)
-------------------------------------------------------------------------
0000000000000000000000000gfedcba | 0gfedcba
000000000000000000000kjihgfedcba | 110kjihg 10fedcba
0000000000000000ponmlkjihgfedcba | 1110ponm 10lkjihg 10fedcba
00000000000utsrqponmlkjihgfedcba | 11110uts 10rqponm 10lkjihg 10fedcba
000000zyxwvutsrqponmlkjihgfedcba | 111110zy 10xwvuts 10rqponm 10lkjihg
                                  |   10fedcba
0EDCBAzyxwvutsrqponmlkjihgfedcba | 1111110E 10DCBAzy 10xwvuts 10rqponm
                                  |   10lkjihg 10fedcba


<32>
    ISO/IEC 10646 is updated from time to time by publication of
    amendments and additional parts; similarly, different versions of the
    Unicode standard are published over time.  Each new version obsoletes
    and replaces the previous one, but implementations, and more
    significantly data, are not updated instantly.

'different versions' gives the impression that these might be
diverging versions.


<33>
    In general, the changes amount to adding new characters, which does
    not pose particular problems with old data.  Amendment 5 to ISO/IEC
    10646, however, has moved and expanded the Korean Hangul block,

As far as I understand, amendments for ISO standards are numbered
separately for each version. So we need to clearly say here that
it is Amendments 5 to 10646:1993. Also, saying when that change
happened (Ken?) will help bringing things in perspective for the
new reader.

    thereby making any previous data containing Hangul characters invalid
    under the new version.  Unicode 2.0 has the same difference from
    Unicode 1.1.  The official justification for allowing such an
    incompatible change was that no implementations and no data
    containing Hangul existed, a statement that is likely to be true but
    remains unprovable.

As I personally had an implementation as well as some data
(in ET++, so this was also part of Lys), this is provably false.
I propose to change this to "The justification for allowing such an
incompatible change was that there were no major implementations
and no significant amounts of data containing Hangul."


<34>
    New versions, and in particular any incompatible changes, have
    consequences regarding MIME character encoding labels, to be
    discussed in section 5.

'character encoding' -> '"charset"' (I fight against the term
'character set' or 'charset' quite a bit, but here, it's the
right word to use, because that's the name of the parameter.)

'New versions have consequences' sounds a bit strange. What about:
The consequences of versioning on MIME "charset" labels, in
particular in the case of incompatible changes, are discussed
in Section 5.


5. Byte order mark (BOM)

This section needs more work. The 'change log' says that it's
mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
much less necessary, and much more of a problem, than for UTF-16.
We should clearly say that with IETF protocols, character encodings
are always either labeled or fixed, and therefore the BOM SHOULD
(and MUST at least for small segments) never be used for UTF-8.
And we should clearly give the main argument, namely that it
breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
(without a BOM) stays exactly the same, but US-ASCII encoded
as UTF-8 with a BOM is different).


<35>
    The Unicode Standard and ISO 10646 define the character "ZERO WIDTH
    NO-BREAK SPACE" (U+FEFF), which is also known informally as "BYTE
    ORDER MARK" (abbreviated "BOM").  The latter name hints at a second
    possible usage of the character, in addition to its normal use as a
    genuine "ZERO WIDTH NO-BREAK SPACE" within text.  This usage,
    suggested by Unicode section 2.7 and ISO/IEC 10646 Annex H
    (informative), is to prepend a U+FEFF character to a stream of
    Unicode characters as a "signature"; a receiver of such a serialized

Unicode characters -> UCS characters ?

    stream may then use the initial character both as a hint that the
    stream consists of Unicode characters, as a way to recognize which
    UCS encoding is involved and, with encodings having a multi-octet
    encoding unit, as a way to recognize the serialization order of the
    octets.

The sentence that ends here is too long. Please split.

             UTF-8 having a single-octet encoding unit, this last
    function is useless and the BOM will always appear as the octet
    sequence EF BB BF.



<40>
    The character sequence representing the Hangul characters for the
    Korean word "hangugo" (U+D55C, U+AD6D, U+C5B4) is encoded in UTF-8 as
    follows:

Please say that this word means Korean (language) in Korean.
And it should probably be spelled hangugeo.


<41>
    The character sequence representing the Han characters for the
    Japanese word "nihongo" (U+65E5, U+672C, U+8A9E) is encoded in UTF-8
    as follows:

Please say that nihongo means Japanese (lanugage).


<42>
    The character U+233B4 (a Chinese character meaning 'stump of tree'),
    prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:

Please don't give an example of a bad practice.


<43>
    This memo is meant to serve as the basis for registration of a MIME
    character set parameter (charset) [RFC2978].

Obviously, UTF-8 is already registered. So I would reword this a bit,
maybe starting "This memo serves as the basis for the registration of...".
Then probably add an IANA consideration section where you say:
"Please update the reference for UTF-8 to point to this memo." or so.


8. Security Considerations

- Most of the attacks described have actually taken place.
   I think some 'might's and 'could's should be changed so that
   it's clearer that these are very realistic threats.

- It might be a good idea, here or somewhere else in the document,
   to provide some regular expressions that fully check UTF-8 byte
   sequences.

   Here is one from the W3C validator, in Perl (because Perl
   allows spaces, this is rather readable :-):

       s/  [\x00-\x7F]                        # ASCII
         | [\xC2-\xDF]        [\x80-\xBF]     # non-overlong 2-byte sequences
         |  \xE0[\xA0-\xBF]   [\x80-\xBF]     # excluding overlongs
         | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte sequences
         |  \xED[\x80-\x9F]   [\x80-\xBF]     # excluding surrogates
         |  \xF0[\x90-\xBF]   [\x80-\xBF]{2}  # planes 1-3
         | [\xF1-\xF3]        [\x80-\xBF]{3}  # planes 4-15
         |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        //xg;

   (this substitutes all legal UTF-8 sequences away; if there
    is something left, it's not UTF-8). This is for planes 0-16 only.

   Another is the ABNF from the usenet draft:
   (http://www.ietf.org/internet-drafts/draft-ietf-usefor-article-06.txt)

       UTF8-xtra-2-head= %xC2-DF
       UTF8-xtra-3-head= %xE0 %xA0-BF / %xE1-EC %x80-BF /
                         %xED %x80-9F / %xEE-EF %x80-BF
       UTF8-xtra-4-head= %xF0 %x90-BF / %xF1-F7 %x80-BF
       UTF8-xtra-5-head= %xF8 %x88-BF / %xF9-FB %x80-BF
       UTF8-xtra-6-head= %xFC %x84-BF / %xFD    %x80-BF
       UTF8-xtra-tail  = %x80-BF
       UTF8-xtra-char  = UTF8-xtra-2-head 1( UTF8-xtra-tail ) /
                         UTF8-xtra-3-head 1( UTF8-xtra-tail ) /
                         UTF8-xtra-4-head 2( UTF8-xtra-tail ) /
                         UTF8-xtra-5-head 3( UTF8-xtra-tail ) /
                         UTF8-xtra-6-head 4( UTF8-xtra-tail )

    This doesn't yet include US-ASCII. Either of them probably
    needs a bit of work. This is for up to 31 bytes.


<59>
The encoding of your name and address, and Alain's and my name,
is messed up. Please don't try to smuggle something around the I-D editor; 
it's not guaranteed to work.


Regards,   Martin.
Received on Wednesday, 17 April 2002 05:41:52 UTC