Re: revised "generic syntax" internet draft from Martin J. Duerst on 1997-04-25 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Fri, 25 Apr 1997 19:43:21 +0200 (MET DST)
To: John C Klensin <klensin@mci.net>
Cc: Edward Cherlin <cherlin@newbie.net>, uri@bunyip.com
Message-Id: <Pine.SUN.3.96.970425192649.245y-100000@enoshima>
On Fri, 25 Apr 1997, John C Klensin wrote:

> 
> On Thu, 24 Apr 1997 23:25:10 -0700 Edward Cherlin 
> <cherlin@newbie.net> wrote:
> 
> >...
> > Those with less popular character sets are out in the cold today. Unicode
> > will bring them in from the cold, since it is a general solution that has
> > fairly wide implementation (in Windows NT, Macintosh OS, several flavors of
> > UNIX, Java, and so on, and in applications such as Microsoft Office 97 and
> > Alis Tango Web browser).
> >...
> > There is no hope of getting every legacy character 
> encoding incorporated
> > into modern software by any means other than Unicode.
> 
> Edward,
> 
> This is not true, and these discussions seem to be 
> difficult enough without engaging in hyperbole.  In 
> particular:
> 
> (i) However widely Unicode is implemented, the actual use 
> patterns are, especially outside of areas that use 
> Latin-based alphabets, much larger for systems based on 
> character set (or code page) switching (mostly, but not 
> entirely, utilizing ISO 2022 designators) than they are for 
> Unicode.

The actual use patterns are rapidly changing. By today, the
two main new PC word processing products (MS Word 97 and
Ichitaro 8) use Unicode inside.
And the decision to base URL internationalization on Unicode
is not based on counting, but on evaluating technical and
user merrit. If you can show me how to realize internationalized
URLs better with ISO 2022 than with UTF-8, please go ahead.


> (ii) In many cases (including with some applications that 
> end up on the system you have mentioned), Unicode (or 
> something like it) is used as an internal representation, 
> but what goes over the wire is a character set switching 
> (or shifting) system.   There is some small (but not zero) 
> risk that Unicode, like X.400, will end up being more of a 
> common conversion and representation format than a format 
> that end-user applications actually use natively.

Even if Unicode should remain a "common conversion and
representation format", that would be fine, because it
is exactly a "common conversion and representation format"
that we need.


> (iii) Even if "Unicode" is the right solution, it doesn't 
> automatically follow that the best representation is UTF-8. 
> A case can be made that, if one is going to have to resort 
> to hex encoding anyway, simply hex-encoding the UCS-2 
> string will give better behavior more of the time (when 
> "more" is considered by weighting the use of different 
> ranges of the Unicode coding set by the number of people in 
> the world who use those characters).

And weighted by the frequency all these users use ASCII
for not (yet) internationalized schemes or parts of URLs
and for URLs that they want to be used worldwide!
And added the point that neither our US friends nor anybody
else wants to start using %HH%HH (or even %HH%HH%HH%HH if
you consider UCS4) for every ASCII character.


> (iv) It is not hard to demonstrate that, in the medium to 
> long term, there are some requirements for character set 
> encoding for which Unicode will not suffice and it will be 
> necessary to go to multi-plane 10646

You are not the first or only one to notice this. Unicode
currently can encode planes 0 to 16 (for a total of about
one million codepoints) by a mechanism called surrogates
or UTF-16. Please check your copy of Unicode vol. 2.

> (which is one of 
> several reasons why IETF recommendation documents have 
> fairly consistently pointed to 10646 and not Unicode).  The 
> two are not the same.  In particular, while the comment in 
> (iii) can easily and correctly be rewritten as a UCS-4 
> statement, UTF-8 becomes, IMO, pathological (and its own 
> excuse for compression) when one starts dealing with plane 
> 3 or 4 much less, should we be unlucky enough to get there, 
> plane 200 or so.

Currently, plane 2 is tentatively planned for rare CJKV
ideographs, and plane 1 is planned for all kinds of other
rare, historical, and obscure scripts. For a text in Egiptian
hieroglyphs or in Klingon, you might have to use four
bytes per character in an UTF-8 representation :-).
One should also mention that there is still considerable
space in the BMP, and that this space is carefully planned
and administered with usage frequency in mind.
For details about all this, please see
    http://www.indigo.ie/egt/standards/iso10646/bmp-roadmap.html

> p.s. I haven't changed my mind -- I still don't like 2022 
> as a "character set" or as a data representation, largely 
> because I don't like stateful character encodings.  But I 
> think we need to make decisions based on reality rather 
> than wishful thinking, evangelism, or pure optimism.

If the above are all your concerns against UTF-8, I can
happily conclude that we stay safely on the reality side
with UTF-8 :-).

Regards,	Martin.
Received on Friday, 25 April 1997 13:43:41 UTC