RE: RFC 2279 (UTF-8) to Full Standard

Martin,

> At 18:10 02/04/11 -0700, Kenneth Whistler wrote:
> 
> >I agree, even though the Unicode Standard only describes UTF-8
> >out to U+10FFFF. 10646 still gives the full scheme to U-7FFFFFFF,
> >and it will be awhile (if ever) before we can change that to
> >deprecate all the 5- and 6-byte values.
> 
> I thought ISO had adopted a standing policy on not allocating
> anything beyond U+10FFFF. Ken, do you know the exact status of
> this? Can you tell us?

Mark pointed out the quotation of WG2 Resolution M38.6 in the
UAX #19 UTF-32, which was a key point for WG2 in synching up
10646 with the Unicode concept of no use of code points beyond
U+10FFFF, to ensure interoperability between the UTF's.

However, we can go beyond that now. Amendment 1 to 10646-1:2000
is now officially published. The relevant clause is clause 9.1:

<quote>
9.1 Planes reserved for future standardization

Planes 11 to FF in Group 00 and all planes in any other groups ...
are reserved for future standardization, and thus those code
positions shall not be used for any other purpose.

Code positions in these planes do not have a mapping to the
UTF-16 form (see Annex C).

  NOTE - To ensure continued interoperability between
  the UTF-16 form and other coded representations of the
  UCS, it is intended that no characters will be allocated
  to code positions in Planes 11 to FF in Group 00 or any
  planes in any other groups.
</quote>

Thus the normative text of the standard says everything beyond
U+10FFFF is reserved, and thou shalt not use it.

The note, while not a normative part of the text, of course,
states pretty clearly the intent of WG2 not to encode past
U+10FFFF. That doesn't constitute a formal "policy" per se,
because neither WG2 nor SC2 has a mechanism for publishing
formal policies (only JTC-1 can do so, in the JTC-1 Guidelines,
and those don't apply to technical content). However, WG2
has a standing "Principles and Procedures" document that
it uses to guide its technical work on 10646, and the intent
not to encode past U+10FFFF is reiterated there, as well,
for the benefit of the national bodies participating in WG2.

But what I was referring to before was the definition of UTF-8,
per se, in Annex D. That specification is still done in
terms of a transformation of code positions for the entire
codespace of 10646, and so defines 1- to 6-octet forms of UTF-8.

Changing that to formally constrain it to 1- to 4-octet forms
applicable only to U+0000..U+10FFFF, as the Unicode Standard
does, was a lower priority for amending 10646, since in effect
the unused sequences past F4 8F BF BF "do no harm". No one uses
them, no one *can* use them, because they refer to code positions
that are reserved and that "shall not be used for any other
purpose".

At some point it would be nice to go back and add a little more
clarificatory text to Annex D in 10646, pointing out that the
constraints of Clause 9.1 effectively mean that all UTF-8
sequences past F4 8F BF BF shall not be used, so that in
practice UTF-8 is a 1- to 4-octet transform. Note also that
UTF-8 is described in Annex D as "an alternative coded
representation form for all of the characters of the UCS". Since
Clause 9.1 says no characters will be encoded past Plane 11,
this implies that UTF-8 will not be used as an alternative
representation form for anything encoded past Plane 11
already. But it is easy to miss this point, since the tables
spell out everything for all the code positions to U-7FFFFFFF.

> 
> >So I see no good reason
> >right now to put RFC 2279 out of synch with 10646, particularly
> >if it would slow down a revision of RFC 2279 now.
> 
> I think the new document should clearly state that codepoints above
> U+10FFFF cannot be encoded in UTF-16, that the Unicode consortium
> won't allocate any codepoints above that, that ISO has some relevant
> policy (if they do),... 

See citation above, which is as close as you will get to a "policy".

--Ken

Received on Friday, 12 April 2002 12:27:01 UTC