- From: Nico Williams <nico@cryptonector.com>
- Date: Sun, 10 Feb 2013 16:45:01 -0600
- To: Roberto Peon <grmocg@gmail.com>
- Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Julian Reschke <julian.reschke@gmx.de>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
On Sun, Feb 10, 2013 at 3:04 PM, Roberto Peon <grmocg@gmail.com> wrote: > Another place where we may need to know about normalization is for caching. > Does the lookup, etc. occur on the normalized form, or on the given data? > > All in all, utf-8 without addendum sucks for protocol work. Normalization is not a UTF-8 thing, it's a Unicode thing, and it's not really a Unicode thing either, but a result of our stupid, human scripts and their stupid collation and other rules. There is *nothing* that we can do for dealing with text that would do both of: a) meet the needs of our users, and b) not suck for string comparison, collation, and other such operations. In other words: all these arguments about how it sucks to deal with UTF-8 or Unicode are not useful arguments. We have to deal with text in at least some parts of our protocols, and that means we have to deal with I18N. Worse, much worse than the problems Unicode brings with it, are the problems of having either no clue what codeset some text is in (interop failures result), or having to support many, many codesets (trade one set of complexities for a bigger one). Clearly it is better to just use Unicode for text in Internet protocols. It is also clear that we can't really have a decent one-size-fits-all static Huffman coding table for text that may be written in any of tens of scripts spanning ~100k codepoints. Now, perhaps we can encourage the world to use URIs not IRIs and so on, but really, that would be a step backwards. My proposal: - All text values in HTTP/2.0 that are also present in HTTP/1.1 should be sent as either UTF-8 or ISO8859-1, with a one-bit tag to indicate which it is. This pushes re-encoding to the ends, but it lets middle boxes re-encode as well where they want or need to, and it gives us a nice upgrade path. - All text values in HTTP/2.0 that are NOT also present in HTTP/1.1 should be sent *only* as UTF-8. Why UTF-8 and not some other encoding of Unicode? Because I don't see how UTF-16 or UTF-32 could help us here. Other encodings seem even less likely to be useful: sure, punycode would be all ASCII, but it wouldn't actually cause static Huffman coding to be useful. At best one can argue that UTF-8 penalizes some scripts with a 50% or 100% expansion relative to script-specific codesets, so that we should prefer UTF-16 or -32 for fairness reasons; let's not. Nico --
Received on Sunday, 10 February 2013 22:45:26 UTC