Unicode sucks, get over it (Re: Delta Compression and UTF-8 Header Values) from Nico Williams on 2013-02-10 (ietf-http-wg@w3.org from January to March 2013)

From: Nico Williams <nico@cryptonector.com>
Date: Sun, 10 Feb 2013 16:45:01 -0600
To: Roberto Peon <grmocg@gmail.com>
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Julian Reschke <julian.reschke@gmx.de>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CAK3OfOgYi-=W_QGJywf3hQbFMkfWv-ceXiJbYEdWM3-iaefP4Q@mail.gmail.com>

On Sun, Feb 10, 2013 at 3:04 PM, Roberto Peon <grmocg@gmail.com> wrote:
> Another place where we may need to know about normalization is for caching.
> Does the lookup, etc. occur on the normalized form, or on the given data?
>
> All in all, utf-8 without addendum sucks for protocol work.

Normalization is not a UTF-8 thing, it's a Unicode thing, and it's not
really a Unicode thing either, but a result of our stupid, human
scripts and their stupid collation and other rules.

There is *nothing* that we can do for dealing with text that would do
both of: a) meet the needs of our users, and b) not suck for string
comparison, collation, and other such operations.

In other words: all these arguments about how it sucks to deal with
UTF-8 or Unicode are not useful arguments.  We have to deal with text
in at least some parts of our protocols, and that means we have to
deal with I18N.

Worse, much worse than the problems Unicode brings with it, are the
problems of having either no clue what codeset some text is in
(interop failures result), or having to support many, many codesets
(trade one set of complexities for a bigger one).  Clearly it is
better to just use Unicode for text in Internet protocols.

It is also clear that we can't really have a decent one-size-fits-all
static Huffman coding table for text that may be written in any of
tens of scripts spanning ~100k codepoints.  Now, perhaps we can
encourage the world to use URIs not IRIs and so on, but really, that
would be a step backwards.

My proposal:

 - All text values in HTTP/2.0 that are also present in HTTP/1.1
should be sent as either UTF-8 or ISO8859-1, with a one-bit tag to
indicate which it is.

   This pushes re-encoding to the ends, but it lets middle boxes
re-encode as well where they want or need to, and it gives us a nice
upgrade path.

 - All text values in HTTP/2.0 that are NOT also present in HTTP/1.1
should be sent *only* as UTF-8.

Why UTF-8 and not some other encoding of Unicode?  Because I don't see
how UTF-16 or UTF-32 could help us here.  Other encodings seem even
less likely to be useful: sure, punycode would be all ASCII, but it
wouldn't actually cause static Huffman coding to be useful.  At best
one can argue that UTF-8 penalizes some scripts with a 50% or 100%
expansion relative to script-specific codesets, so that we should
prefer UTF-16 or -32 for fairness reasons; let's not.

Nico
--

Received on Sunday, 10 February 2013 22:45:26 UTC