Re: Unicode sucks, get over it (Re: Delta Compression and UTF-8 Header Values) from Julian Reschke on 2013-02-11 (ietf-http-wg@w3.org from January to March 2013)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Mon, 11 Feb 2013 09:35:45 +0100
To: Nico Williams <nico@cryptonector.com>
CC: Roberto Peon <grmocg@gmail.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <5118AD61.6030003@gmx.de>

On 2013-02-10 23:45, Nico Williams wrote:
> On Sun, Feb 10, 2013 at 3:04 PM, Roberto Peon <grmocg@gmail.com> wrote:
>> Another place where we may need to know about normalization is for caching.
>> Does the lookup, etc. occur on the normalized form, or on the given data?
>>
>> All in all, utf-8 without addendum sucks for protocol work.
>
> Normalization is not a UTF-8 thing, it's a Unicode thing, and it's not
> really a Unicode thing either, but a result of our stupid, human
> scripts and their stupid collation and other rules.
>
> There is *nothing* that we can do for dealing with text that would do
> both of: a) meet the needs of our users, and b) not suck for string
> comparison, collation, and other such operations.
>
> In other words: all these arguments about how it sucks to deal with
> UTF-8 or Unicode are not useful arguments.  We have to deal with text
> in at least some parts of our protocols, and that means we have to
> deal with I18N.
>
> Worse, much worse than the problems Unicode brings with it, are the
> problems of having either no clue what codeset some text is in
> (interop failures result), or having to support many, many codesets
> (trade one set of complexities for a bigger one).  Clearly it is
> better to just use Unicode for text in Internet protocols.
>
> It is also clear that we can't really have a decent one-size-fits-all
> static Huffman coding table for text that may be written in any of
> tens of scripts spanning ~100k codepoints.  Now, perhaps we can
> encourage the world to use URIs not IRIs and so on, but really, that
> would be a step backwards.
>
> My proposal:
>
>   - All text values in HTTP/2.0 that are also present in HTTP/1.1
> should be sent as either UTF-8 or ISO8859-1, with a one-bit tag to
> indicate which it is.
> ...

Why do we need two options?

Best regards, Julian

Received on Monday, 11 February 2013 08:36:17 UTC