Widetext (was Re: Registration of new charset "UTF-16") from Chris Newman on 1998-05-15 (ietf-charsets@w3.org from April to June 1998)

From: Chris Newman <Chris.Newman@INNOSOFT.COM>
Date: Fri, 15 May 1998 10:18:52 -0700 (PDT)
To: "Martin J. Duerst" <duerst@w3.org>
Cc: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@ISI.EDU, murata@fxis.fujixerox.co.jp, Tatsuo_Kobayashi@justsystem.co.jp
Message-id: <Pine.SOL.3.95.980515094324.594I-100000@elwood.innosoft.com>

On Fri, 15 May 1998, Martin J. Duerst wrote:
> At 12:08 98/05/14 -0700, Chris Newman wrote:
> > We might eventually define a MIME "widetext" top-level media type for
> > plaintext data using UTF-16 or UCS-4, but I don't think it's time to do
> > that yet.  UTF-8 is standards track and may be freely used in text/* media
> > types.
> 
> Why not? One problem is to find a good name for it, and you just gave
> one above, there may be others. For the rest, it's pretty easy. Put
> together stuff from HTTP1.1 and from the MIME RFCs.

I honestly don't think we have nearly enough experience using Unicode on
the Internet for interoperable internationalization.  There are lots of
issues relating to canonicalization, whitespace, line-ending characters
and other things in Unicode which make me very nervous from an
interoperability standpoint.  I think it's premature to start sending
around UTF-16 because it takes all the Unicode-related problems and
compounds them by adding a slew of binary and endian-related problems
(which have been known to cause trouble in the past). 

I believe everyone should use UTF-8 for now and once we've got the
Unicode-related problems ironed out, then we can start worrying about the
binary, backwards-compatibility and endian-related problems UTF-16 will
cause later. Ultimately, I want interoperable international characters to
become reality, but the more potholes there are on the road today, the
more likely people are to turn away. 

There are other things we could do when we deploy a widetext/* top-level
media type.  We might want to also deploy a compressing
content-transfer-encoding at the same time and prefer UCS-4 over UTF-16 --
we might even be able to skip the UTF-16 step altogether at least for
transmission over the Internet.  That would be one less interoperability
problem.  Unicode's "Line Separator" and "Paragraph Separator" codepoints
might just work, so in widetext/* we might want to mandate their use
instead of CRLF so we really have a canonical cross-platform plain text
format.  I have no idea if any of this will work and I don't think we have
the experience we need to do it right.

> Make it so that
> widetext/* in the HTTP MIME derivative is equivalent to text/*. The
> sooner we do it, the sooner we get rid of the problems with interchange
> between HTTP-delivered content and other protocols, and the sooner we
> can have full internationalization in email. Email UA implementors
> won't have much work on this one, but they have to know what to do.

I think it's best just to use UTF-8 in email for now.  There is going to
be *a lot* of real world opposition to deploying ISO-10646/Unicode in
email.  I expect UTF-7 to be reviled as much as quoted-printable or RFC
2047, and it was a mistake to promote it.  UTF-16 will be an unreadable
blob to most email recipients; anyone sending it would rightfully be
flamed.  I don't want UTF-8/ISO-10646 to lose to our current plethora of
character sets in the flurry of opposition which UTF-7 and UTF-16 will
create.

We have one standards track character set, UTF-8, which will cause the
least pain to deploy.  Let's promote that until it works, then worry about
saving bytes in the encoding.  I'm aware this is a bit harsh for our
friends with ideographic characters, but I think the outcome will be
better in the long term.

		- Chris

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Friday, 15 May 1998 11:04:35 UTC