RE: Comments on draft-yergeau-rfc2279bis-00.txt from Francois Yergeau on 2002-04-17 (ietf-charsets@w3.org from April to June 2002)

From: Francois Yergeau <FYergeau@alis.com>
Date: Wed, 17 Apr 2002 16:51:19 -0400
To: charsets <ietf-charsets@iana.org>
Message-id: <F7D4BDA0E5A1D14B99D32C022AEB73663A3896@alis-2k.alis.domain>
Martin Duerst wrote:
> While the title of ISO/IEC 10646 includes 'multi-octet', I think
> this is confusing, because we want to clearly separate characters,
> their numbers in the UCS, and the actual encoding into octets,...
> I suggest you remove 'multi-octet' everywhere except for the
> formal title in the reference, and if necessary replace it
> with something like 'large'.

OK.  The abstract now starts:

   ISO/IEC 10646-1 defines a large character set called the
   Universal Character Set (UCS) which encompasses most of the world's
   writing systems. The originally proposed encodings of the UCS, however,
were not compatible
   with ...

Same for the first para of the introduction.  Other instances of
"multi-octet" were of a different kind and I left them alone.  Please check.


> <13>
>     o  The lexicographic sorting order of strings is preserved.  Of
>        course this is of limited interest since a sort order based on
>        character numbers is not culturally valid.
> 
> 'preserved' in respect to what?

Good catch. Now "The lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers."


> <14>
>     o  The Boyer-Moore fast search algorithm can be used with 
> UTF-8 data.
> 
> This should be worded more general, at least inserting something
> like 'and similar algorithms'.

Well, I do know about Boyer-Moore, but not others.  I wouldn't want to
generalize to something wrong.


> <15>
>     o  UTF-8 strings can be fairly reliably recognized as such by a
>        simple algorithm, i.e.  the probability that a string of
>        characters in any other encoding appears as valid UTF-8 is low,
>        diminishing with increasing string length.
> 
> This should maybe somehow mention the special case of an US-ASCII-only
> string (which can be easily detected, but...).

It's already mentionned in <7>.


> <16>
> [...]
>     Entenmann.  Later, Ken Thompson and Rob Pike did 
> significant work for the formal UTF-8.
> 
> formal UTF-8 -> formal definition of UTF-8 ?

OK, done.


> <25>
>     3.  Fill in the bits marked x from the bits of the 
> character number,
>         expressed in binary.  Start from the lower-order bits of the
>         character number and put them first in the last octet of the
>         sequence, then the next to last, etc.  until all x bits are
>         filled in.
> 
> This misses one important detail: the sequence in which the bits
> are filled into a byte. This should be fixed. Maybe we can
> make things even clearer, as follows:

This text dates back to RFC 2044 (October 1196) and since then nobody has
complained, in fact I have had a few reports from people saying this was the
clearest exposition of UTF-8 they had seen.  I'm therefore very reluctant to
change it!

It seems that people know where "lo-order bits" go in a byte.  Your proposed
table may be more explicit, not necessarily clearer.


> <32>
>     ISO/IEC 10646 is updated from time to time by publication of
>     amendments and additional parts; similarly, different 
> versions of the
>     Unicode standard are published over time.  Each new 
> version obsoletes
>     and replaces the previous one, but implementations, and more
>     significantly data, are not updated instantly.
> 
> 'different versions' gives the impression that these might be
> diverging versions.

s/different versions/new versions/


> As far as I understand, amendments for ISO standards are numbered
> separately for each version. So we need to clearly say here that
> it is Amendments 5 to 10646:1993. Also, saying when that change
> happened (Ken?) will help bringing things in perspective for the
> new reader.

Good catch and good idea.  New text:

"In 1996, Amendment 5 to the 1993 edition of ISO/IEC 10646 and Unicode 2.0
moved and expanded the Korean Hangul block, thereby ..."

The sentence "Unicode 2.0 has the same difference from Unicode 1.1." goes
away since covered above.



>     Unicode 1.1.  The official justification for allowing such an
>     incompatible change was that no implementations and no data
>     containing Hangul existed, a statement that is likely to 
> be true but remains unprovable.
> 
> As I personally had an implementation as well as some data
> (in ET++, so this was also part of Lys), this is provably false.
> I propose to change this to "The justification for allowing such an
> incompatible change was that there were no major implementations
> and no significant amounts of data containing Hangul."

OK, done.


> <34>
>     New versions, and in particular any incompatible changes, have
>     consequences regarding MIME character encoding labels, to be
>     discussed in section 5.
> 
> 'character encoding' -> '"charset"' (I fight against the term
> 'character set' or 'charset' quite a bit, but here, it's the
> right word to use, because that's the name of the parameter.)

Yes, done.


> 'New versions have consequences' sounds a bit strange. What about:
> The consequences of versioning on MIME "charset" labels, in
> particular in the case of incompatible changes, are discussed
> in Section 5.

Increases obscurity IMHO.  If you insist, I can say "The appearance of new
versions..."


> 5. Byte order mark (BOM)
> 
> This section needs more work. The 'change log' says that it's
> mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
> much less necessary, and much more of a problem, than for UTF-16.
> We should clearly say that with IETF protocols, character encodings
> are always either labeled or fixed, and therefore the BOM SHOULD
> (and MUST at least for small segments) never be used for UTF-8.
> And we should clearly give the main argument, namely that it
> breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
> (without a BOM) stays exactly the same, but US-ASCII encoded
> as UTF-8 with a BOM is different).

I don't quite see your point.  A US-ASCII string, with or without a BOM, is
always a valid UTF-8 string, I don't see where compatibility is broken.  I
can see that protocols shouldn't *require* a BOM, because then a strict
(BOM-less) ASCII string wouldn't meet the requirement.  But that's not what
you're saying, right?


> <35>
>     Unicode characters as a "signature"; a receiver of such a 
> serialized
> 
> Unicode characters -> UCS characters ?

OK (two places).


>     stream may then use the initial character both as a hint that the
>     stream consists of Unicode characters, as a way to recognize which
>     UCS encoding is involved and, with encodings having a multi-octet
>     encoding unit, as a way to recognize the serialization 
>     order of the octets.
> 
> The sentence that ends here is too long. Please split.

OK, done.
 

> <40>
>     The character sequence representing the Hangul characters for the
>     Korean word "hangugo" (U+D55C, U+AD6D, U+C5B4) is encoded 
> in UTF-8 as follows:
> 
> Please say that this word means Korean (language) in Korean.
> And it should probably be spelled hangugeo.

I've reworded the first 3 examples along the lines of:

     The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo",
     meaning "the Korean language") is encoded in UTF-8 as follows:

> <42>
>     The character U+233B4 (a Chinese character meaning 'stump 
> of tree'), prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:
> 
> Please don't give an example of a bad practice.

I'll agree if we end up banning it, but otherwise I'd rather show it.


> <43>
>     This memo is meant to serve as the basis for registration 
> of a MIME character set parameter (charset) [RFC2978].
> 
> Obviously, UTF-8 is already registered. So I would reword this a bit,
> maybe starting "This memo serves as the basis for the 
> registration of...".

OK: "This memo serves as the basis for registration of the MIME
charset parameter for UTF-8, according to [RFC2978]."


> Then probably add an IANA consideration section where you say:
> "Please update the reference for UTF-8 to point to this memo." or so.

Does that really belong *in* the doc itself?


> 8. Security Considerations
> 
> - Most of the attacks described have actually taken place.
>    I think some 'might's and 'could's should be changed so that
>    it's clearer that these are very realistic threats.

Suggestions?


> - It might be a good idea, here or somewhere else in the document,
>    to provide some regular expressions that fully check UTF-8 byte
>    sequences.

Regexps would be nice, but we'd need to refer to a definition of the regexp
language itself.  Any suitable source?


Thanks for all the good comments!

-- 
François
Received on Wednesday, 17 April 2002 16:52:36 UTC