Re: UTF-8 revision

=?iso-8859-1?Q?Martin_J=2E_D=FCrst?= (mduerst@ifi.unizh.ch)
Mon, 13 Oct 1997 11:30:39 +0100 (MET)


Date: Mon, 13 Oct 1997 11:30:39 +0100 (MET)
From: =?iso-8859-1?Q?Martin_J=2E_D=FCrst?= <mduerst@ifi.unizh.ch>
Subject: Re: UTF-8 revision
In-reply-to: <998.873624127@dale.uninett.no>
To: Harald.T.Alvestrand@uninett.no
Cc: Francois Yergeau <yergeau@alis.com>, ietf-charsets@INNOSOFT.COM
Message-id: <Pine.SUN.3.96.971008201625.7026i-100000@enoshima>

On Sun, 7 Sep 1997 Harald.T.Alvestrand@uninett.no wrote:

> Francois,
> your revision seems good to me (as usual).

Seems very good to me, too!

> Some nits:

[More nits from my side below (as usual :-).]


> - The text now appears to treat Unicode and ISO 10646 as bodies of
>   equal standing. I would like to refer as much as possible ONLY to
>   ISO 10646, and remove "unnecessary" references to Unicode, keeping
>   only enough information to ensure that a reader sees how Unicode is
>   equivalent to ISO 10646 as of now.
>   The main reason is because of the problems John Klensin mentioned
>   about ISO being more of an "acceptable standards body" in the IETF
>   than the Unicode Consortium is; the other reason is that I *hate*
>   depending on two variable external references when one is enough.

I somehow agree with Harald. But if ISO 10646 is $300 or more
(there are countries where you get this for free, or can have
a look for free in a library, but these countries are rare), and
Unicode is $50, I would dare to say that the Unicode book is more
acceptable as a publication than the ISO 10646 book to the IETF :-).
In addition, Unicode has a lot of the actual details of their
standard on their web site, but ISO doesn't. Again in this respect,
Unicode seems more acceptable for the IETF than ISO.

So the solution seems to be: Officially depend on ISO, but tell
people where they can find stuff easily and cheaper.


> - Note: I think it makes sense to call this document for Proposed
>   Standard; there is no particular value in having its status be
>   Informational. (The two other documents in the package, the charset
>   policy and the registration document, are both headed for BCP, I
>   think; objectors speak up!)

I think most documents defining "charset"s are informational,
even if they are extremely standard in their actual use.
But because UTF-8 is so central for the IETF, it's a good
idea to make it proposed.


> Thought for list: One alternative to registering UNICODE-1-1-UTF-8 is
> to standardize the "charset-edition" of RFC 1922 section 4.1.
> Comments on this alternative?

I disagree, with others. I opposed this when RFC 1922 was discussed,
unfortunately not with much success. But maybe as some other features,
it's just forgotten :-).



Some details:

> Abstract
> 
>    ISO/IEC 10646-1 defines a multi-octet character set called the Uni-
>    versal Character Set (UCS) which encompasses most of the world's
>    writing systems. Multi-octet characters, however, are not compatible
>    with many current applications and protocols, and this has led to the
>    development of a few so-called UCS transformation formats (UTF), each
>    with different characteristics.  UTF-8, the object of this memo, has
>    the characteristic of preserving the full US-ASCII range, providing
>    compatibility with file systems, parsers and other software that rely
>    on US-ASCII values but are transparent to other values. This memo
>    updates and replaces RFC 2044, in particular addressing the question
>    of versions of the relevant standards.

I start to have some problems with the term "multi-octet" here,
and these problems continue. There is terminology distinguishing
'multibyte' and 'wide character', and UCS-2 and UCS-4 seem to
belong to the later category rather than to the former.

But this is a terminological nitpick which maybe isn't worth
persuing.

> 1.  Introduction
> 
>    ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set
>    called the Universal Character Set (UCS), which encompasses most of
>    the world's writing systems.  Two multi-octet encodings are defined,
>    a four-octet per character encoding called UCS-4 and a two-octet per
>    character encoding called UCS-2, able to address only the first 64K
>    characters of the UCS (the Basic Multilingual Plane, BMP), outside of
>    which there are currently no assignments.

"Outside of which there are currently no assignements" will read
very historical very soon. There are plans for assignements in
plane 1 and plane 2, and there is the plane 14 proposal.



>    It is noteworthy that the same set of characters is defined by the
>    Unicode standard [UNICODE], which further defines additional charac-
>    ter properties and other application details of great interest to
>    implementors, but does not have the UCS-4 encoding.  Up to the pre-
>    sent time, changes in Unicode and amendments to ISO/IEC 10646 have
>    tracked each other, so that the character repertoires and code point
>    assignments have remained in sync.  The relevant standardization com-
>    mittees have committed to maintain this very useful synchronism.

To make things clear, I would write "the same set of characters,
with identical codepoints" or something like that, at the start
of the paragraph. Set of characters sounds like it's only the same
repertoire (set in the mathematical sense).
Also, "does not have the UCS-4 encoding" may be read as if UCS-2
is the only encoding in Unicode, so that it is limited to 64K.
Acutally, Unicode uses UTF-16, so it is not restricted to 64K.


>    The UCS-2 and UCS-4 encodings, however, are hard to use in many cur-
>    rent applications and protocols that assume 8 or even 7 bit charac-
>    ters.  Even newer systems able to deal with 16 bit characters cannot
>    process UCS-4 data. This situation has led to the development of so-
>    called UCS transformation formats (UTF), each with different charac-
>    teristics.
> 
>    UTF-1 has only historical interest, having been removed from ISO/IEC
>    10646.  UTF-7 has the quality of encoding the full BMP repertoire
>    using only octets with the high-order bit clear (7 bit US-ASCII 
>    values, [US-ASCII]), and is thus deemed a mail-safe encoding
>    ([RFC2152]).  UTF-8, the object of this memo, uses all bits of an
>    octet, but has the quality of preserving the full US-ASCII range: 
>    US-ASCII characters are encoded in one octet having the normal 
>    US-ASCII value, and any octet with such a value can only stand for 
>    an US-ASCII character, and nothing else.

UTF-7 also encodes plane 1-16, using surrogates.

"mail-safe encoding" -> replace this by something like "the 7bit
MIME (pseudo) content transfer encoding". With MIME or with ESMTP,
UTF-8 is also safe. Mail is getting safer every day.


>    UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
>    into pairs of UCS-2 values from a reserved range.  UTF-16 impacts
>    UTF-8 in that UCS-2 values from the reserved range must be treated
>    specially in the UTF-8 transformation.

There is some confusion with terms. UCS-2 only encompasses the
characters directly assigned in the BMP. As soon as you use the
surrogate area, it is no longer UCS-2, it is UTF-16.
It is true (and very important!) that when transcoding UTF-16 to
UTF-8, the surrogate area (this is an Unicode term; the ISO
term is different) needs special treatment.


>    UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
>    octets, where the number of octets, and the value of each, depend on
>    the integer value assigned to the character in ISO/IEC 10646.  This
>    transformation format has the following characteristics (all values
>    are in hexadecimal):

Why not say "UTF-8 encodes UCS characters"? That you mention the
integer value of an UCS character here is very helpful, but
introducing it much earlier would probably make things even
easier to understand.


>    -  Round-trip conversion is easy between UTF-8 and either of UCS-4,
>       UCS-2.

The way you write it, it seems to say that the UTF-8 -> UCS-2 -> UTF-8
roundtrip is easy. This is of course not always the case. What
is easy is UCS-2 -> UTF-8 -> UCS-2.


>    -  The octet values FE and FF never appear.

And so do the octets (in binary) 11000000 and 11000001 (C0 and C1),
because they would code ASCII with two octets. I would propose
that you either move this to become the second-last point, or
that you mention why FE and FF are important (I seem to remember
there was something special about them in telnet).


>    -  The Boyer-Moore fast search algorithm can be used with UTF-8 data.

There are many other algorithms for fast search, such as Rabin-Karp,
Knuth-Morris-Pratt,...
I would propose saying something like:
	Fast search algorithms such as the Boyer-Moore algorithm,
	and their implementations for octet sequences, can be used
	without changes, in particular without drastically increasing
	storage requirements and without producing false positives.


>    -  UTF-8 strings can be fairly reliably recognized as such by a sim-
>       ple algorithm, i.e. the probability that a string of characters in
>       any other encoding appears as valid UTF-8 is low, diminishing with
>       increasing string length.

The handouts of my UTF-8 paper of last Unicode conference are now
available on the web from
	http://www.ifi.unizh.ch/mml/mduerst/papers.html#IUC11-UTF-8
(Postscript; sorry, no PDF yet). This may be suited as a reference
here. Also, there is code for this check in draft-ietf-ftpext-intl-ftp-02.txt.


>    UTF-8 was originally a project of the X/Open Joint Internationaliza-
>    tion Group XOJIG with the objective to specify a File System Safe UCS
>    Transformation Format [FSS-UTF] that is compatible with UNIX systems,
>    supporting multilingual text in a single encoding.  The original
>    authors were Gary Miller, Greger Leijonhufvud and John Entenmann.
>    Later, Ken Thompson and Rob Pike did significant work for the formal
>    UTF-8.

What do you mean by "formal"?


>    A description can also be found in Unicode Technical Report #4 and in
>    the Unicode Standard, version 2.0 [UNICODE].  The definitive refer-
>    ence, including provisions for UTF-16 data within UTF-8, is Annex R
>    of ISO/IEC 10646-1 [ISO-10646].

It should be mentionned that [UNICODE] contains algorithms for both
directions that also take UTF-16 into account.


>    Encoding from UCS-4 to UTF-8 proceeds as follows:

>    3) Fill in the bits marked x from the bits of the character value,
>       starting from the lower-order bits of the character value and
>       putting them first in the last octet of the sequence, then the
>       next to last, etc. until all x bits are filled in.

That's correct, but rarely what an implementation does. Why not say:

Fill in the bits marked x from the bits of the character value,
by ignoring any leading zero bits, and by starting from the higher-
order bits of the character value and from the higher-order
positions in the first octet of the sequence, and so on.



>       The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
>       obtained from the above, in principle, by simply extending each
>       UCS-2 character with two zero-valued octets.  However, pairs of
>       UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
>       parlance), being actually UCS-4 characters transformed through
>       UTF-16, need special treatment: the UTF-16 transformation MUST be
>       undone, yielding a UCS-4 character that is then transformed as
>       above.

This has to be changed to conform to the fact that UCS-2 doesn't
include surrogate values.


>       Decoding from UTF-8 to UCS-4 proceeds as follows:

>    3) Distribute the bits from the sequence to the UCS-4 character,
>       first the lower-order bits from the last octet of the sequence and
>       proceeding to the left until no x bits are left.

Again, I think starting from higher-order is what most algorithms
do, and is easier to describe.


>       If the UTF-8 sequence is no more than three octets long, decoding
>       can proceed directly to UCS-2.

The association between 3 octets and UCS-2 is true, but is in
conflict with your use of "UCS-2" to mean UTF-16. Also, what
does it mean that decoding can proceed directly to UCS-2?
If you are in the process of decoding from UTF-8 to UCS-4,
you cannot suddenly change to outputting UCS-2. What you can
say is that: An UTF-8 text that doesn't contain any sequences
with length greater than three can be decoded into UCS-2.
If longer sequences appear, some of them may be decodable
into UTF-16, while others just cannot be decoded without
using UCS-4.


>    A more detailed algorithm and formulae can be found in [FSS_UTF],
>    [UNICODE] or Annex R to [ISO-10646].

Guess the reference to 10646 should come first here.


> 3.  Versions of the standards
> 
>    ISO/IEC 10646 is updated from time to time by published amendments;
>    similarly, different versions of the Unicode standard exist: 1.0, 1.1
>    and 2.0 as of this writing.  Each new version obsoletes and replaces
>    the previous one, but implementations, and more significantly data,
>    are not updated instantly.
> 
>    In general, the changes amount to adding new characters, which does
>    not pose particular problems with old data.  Amendment 5 to ISO/IEC
>    10646, however, has moved and expanded the Korean Hangul block,
>    thereby making any previous data containing Hangul characters invalid
>    under the new version.  Unicode 2.0 has the same difference from Uni-
>    code 1.1. The official justification for allowing such an incompati-
>    ble change was that no implementations and no data containing Hangul
>    existed, a statement that is likely to be true but remains unprov-
>    able.

It is not true. I have such an implementation, and a little bit of
data (mainly a keyboard table). I also think that this was never
really claimed, it was only claimed that such implementations/data
were insignificant.


> The incident has been dubbed the "Korean mess", and the rele-
>    vant committees have pledged to never, ever again make such an incom-
>    patible change.

Good way to put it!


> 5.  MIME registration

>    It is noteworthy that the label "UTF-8" does not contain a version
>    identification, referring generically to ISO/IEC 10646.  This is
>    intentional, the rationale being as follows:

I think the whole rationale and presentation here are extremely good.
Please let's make sure that the relevant UNicode/ISO officals get to
read this document. I think most of them got the message, but it can't
do any harm if they get it twice.


> 6.  Security Considerations
> 
>    Implementors of UTF-8 need to consider the security aspects of how
>    they handle illegal UTF-8 sequences.  It is conceivable that in some
>    circumstances an attacker would be able to exploit an incautious
>    UTF-8 parser by sending it an octet sequence that is not permitted by
>    the UTF-8 syntax.
> 
>    A particularly subtle form of this attack could be carried out
>    against a parser which performs security-critical validity checks
>    against the UTF-8 encoded form of its input, but interprets certain
>    illegal octet sequences as characters.  For example, a parser might
>    prohibit the NUL character when encoded as the single-octet sequence
>    00, but allow the illegal two-octet sequence C0 80 and interpret it
>    as a NUL character.  Another example might be a parser which
> 
> 
> 
>                           Expires 15 March 1998         [Page 7]
> 
> Internet Draft                    UTF-8                10 September 1998
> 
> 
>    prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
>    illegal octet sequence 2F C0 AE 2E 2F.

Very good! This will hopefully make sure decoders check carefully,
and so will prevent people from comming up with ideas of how to
"extend" UTF-8 in messy ways.



> Acknowledgments
> 
>    The following have participated in the drafting and discussion of
>    this memo:
> 
>    James E. Agenbroad    Andries Brouwer
>    Martin J. Dürst       Ned Freed

Please replace the ü by 'ue'. Past experience with the RFC editor
has shown that it's not possible to smuggle such things past him/her,
and I prefer to end up an Duerst rather than as Drst or D|rst or
whatever. And I guess next time you try, you should try with UTF-8
instead of iso-8859-1 :-).


This concludes the nits on the draft itself. There is one other
point I would like to mention:

There was some discussion about whether the UTF-8 encoding of the BOM
would be acceptable as a "magic number". There is of course no need to
have the BOM (byte order mark) for byte ordering purposes, there is only
one byte ordering in UTF-8. But it still can make sense purely as a
magic number. It would be nice if there was a clear statement about whether
and when this is possible/desirable, but I don't know whether consensus
on this issue has been reached or can be reached. Because in the
context of the IETF, UTF-8 is either identified by label (e.g. mail
headers and bodies), or is the only encoding (e.g. ACAP), or appears
in short identifiers where the use of a BOM would only complicate
things (e.g. FTP file names), I can see two solutions:

A) For IETF protocols, the BOM is not allowed as a magic number in UTF-8.
	[It is of course at the discression of local implementations to
	use it as a magic number, but it then has to be removed before
	sending.]

B) For IETF protocols, the BOM is strongly discouraged for running text
	and disallowed for identifiers. Implementations should accept and
	remove the BOM in the sense of "be liberal in what you accept".


Regards,	Martin.






--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)