UTF-8 or UNICODE-X-X-UTF-8? (was Re: Accept-Charset support)

À 10:16 10-12-96 -0800, Erik van der Poel a écrit :
>RFC 2044 explicitly refers to Unicode 1.1. Martin seems to think that
>the "consensus" is that "utf-8" refers to Unicode 2.0. However, any
>perceived consensus is useless unless it's documented.

RFC 2044 was written, and even approved, before the publication of Unicode
2.0 (which was expected and anticipated, though).  The publication of RFC
2044 long after that of Unicode 2.0 is entirely due to delays in the
publication process itself.  But I agree that the RFC could be (much)
clearer regarding the intended meaning of the "UTF-8" tag w/r to Unicode
versions.  Mea maxima culpa.

>Also, there may not be much extent Unicode 1.1 data on the net, but
>there are installed copies of software that assume 1.1. E.g. Netscape
>Navigator 3.0, which uses Unicode 1.1 conversions for KS C 5601 in Java.

A fair number of installed copies, indeed, but I fail to see why this is a
problem if there is no *Korean Hangul* 1.1 data out there to break things
(other 1.1 data doesn't count, it's upwards compatible).

If you are afraid there might be (or might appear) some such data, go ahead
and register "UNICODE-1-1-UTF-8" for that purpose, but please make sure to
leave "UTF-8" for Unicode 2.0 and above.  It would be a pity to register
"UNICODE-2-0-UTF-8" when it's 1.1 data that needs to be distinguished.

>Did they also pledge to refrain from re-using the codepoints U+3400 to
>U+3D2D in the future?

Not at all, BMP space is in great demand!  However, I've heard (hearsay, not
first hand) that some ISO folks think that reallocation of this area should
be delayed as long as possible.  Given the current pace of allocation and
the existence of about 20K other free code points, that buys some time for a
smooth transition.

>David Goldsmith's spec allowed for a very regular naming convention,
>which could be recognized/parsed. The client implementation could
>recognize UNICODE-3-0-UTF-8 as being a new version of Unicode.

True, it's feasible, but the convention itself cannot be registered,
registration of each new Unicode version is not automatic, and it locks you
forever into a parsing/guessing game on labels that should be tokens, with
no clear benefit.  By contrast, registering "UNICODE-1-1-UTF-8" now solves
the immediate problem with Korean (if really needed), you don't need to
implement a parser for charset labels, and you can forget the whole issue,
at least until the committees screw up again :-(

>> The
>> registration of "UTF-8" is a bet that the relevant committees will stick to
>> their word.
>
>Hey, you're betting using *my* money! :-) Just kidding.

;-)

-- 
François Yergeau <yergeau@alis.com>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561

Received on Tuesday, 10 December 1996 21:24:11 UTC