Re: Unicode sucks, get over it (Re: Delta Compression and UTF-8 Header Values) from Phillip Hallam-Baker on 2013-02-11 (ietf-http-wg@w3.org from January to March 2013)

From: Phillip Hallam-Baker <hallam@gmail.com>
Date: Sun, 10 Feb 2013 20:19:40 -0500
To: Nico Williams <nico@cryptonector.com>
Cc: Roberto Peon <grmocg@gmail.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, Julian Reschke <julian.reschke@gmx.de>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CAMm+LwinEdtDnaDcrVtHBQvbv4Yui_RQ08C1PbakCCWdLsKSug@mail.gmail.com>

+1

A lot of the 'security issues' folk come up with relating to homographs
etc. turn out to be complete busts in practice.

URIs have always been case sensitive. Why should the encoding of accents
not be sensitive as well?

Yes, someone might confuse two similar glyphs from different character
sets. But what is the probability that someone would type the weird
encoding in at the keyboard? I think its zero.

Sure you can come up with corner cases and there are some security issues.
But I can't see a HTTP issue here still less a reason to be concerned about
encoding

On Sun, Feb 10, 2013 at 5:45 PM, Nico Williams <nico@cryptonector.com>wrote:

> On Sun, Feb 10, 2013 at 3:04 PM, Roberto Peon <grmocg@gmail.com> wrote:
> > Another place where we may need to know about normalization is for
> caching.
> > Does the lookup, etc. occur on the normalized form, or on the given data?
> >
> > All in all, utf-8 without addendum sucks for protocol work.
>
> Normalization is not a UTF-8 thing, it's a Unicode thing, and it's not
> really a Unicode thing either, but a result of our stupid, human
> scripts and their stupid collation and other rules.
>
> There is *nothing* that we can do for dealing with text that would do
> both of: a) meet the needs of our users, and b) not suck for string
> comparison, collation, and other such operations.
>
> In other words: all these arguments about how it sucks to deal with
> UTF-8 or Unicode are not useful arguments.  We have to deal with text
> in at least some parts of our protocols, and that means we have to
> deal with I18N.
>
> Worse, much worse than the problems Unicode brings with it, are the
> problems of having either no clue what codeset some text is in
> (interop failures result), or having to support many, many codesets
> (trade one set of complexities for a bigger one).  Clearly it is
> better to just use Unicode for text in Internet protocols.
>
> It is also clear that we can't really have a decent one-size-fits-all
> static Huffman coding table for text that may be written in any of
> tens of scripts spanning ~100k codepoints.  Now, perhaps we can
> encourage the world to use URIs not IRIs and so on, but really, that
> would be a step backwards.
>
> My proposal:
>
>  - All text values in HTTP/2.0 that are also present in HTTP/1.1
> should be sent as either UTF-8 or ISO8859-1, with a one-bit tag to
> indicate which it is.
>
>    This pushes re-encoding to the ends, but it lets middle boxes
> re-encode as well where they want or need to, and it gives us a nice
> upgrade path.
>
>  - All text values in HTTP/2.0 that are NOT also present in HTTP/1.1
> should be sent *only* as UTF-8.
>
> Why UTF-8 and not some other encoding of Unicode?  Because I don't see
> how UTF-16 or UTF-32 could help us here.  Other encodings seem even
> less likely to be useful: sure, punycode would be all ASCII, but it
> wouldn't actually cause static Huffman coding to be useful.  At best
> one can argue that UTF-8 penalizes some scripts with a 50% or 100%
> expansion relative to script-specific codesets, so that we should
> prefer UTF-16 or -32 for fairness reasons; let's not.
>
> Nico
> --
>
>


-- 
Website: http://hallambaker.com/

Received on Monday, 11 February 2013 01:20:09 UTC