- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Mon, 11 Feb 2013 09:35:45 +0100
- To: Nico Williams <nico@cryptonector.com>
- CC: Roberto Peon <grmocg@gmail.com>, Poul-Henning Kamp <phk@phk.freebsd.dk>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
On 2013-02-10 23:45, Nico Williams wrote: > On Sun, Feb 10, 2013 at 3:04 PM, Roberto Peon <grmocg@gmail.com> wrote: >> Another place where we may need to know about normalization is for caching. >> Does the lookup, etc. occur on the normalized form, or on the given data? >> >> All in all, utf-8 without addendum sucks for protocol work. > > Normalization is not a UTF-8 thing, it's a Unicode thing, and it's not > really a Unicode thing either, but a result of our stupid, human > scripts and their stupid collation and other rules. > > There is *nothing* that we can do for dealing with text that would do > both of: a) meet the needs of our users, and b) not suck for string > comparison, collation, and other such operations. > > In other words: all these arguments about how it sucks to deal with > UTF-8 or Unicode are not useful arguments. We have to deal with text > in at least some parts of our protocols, and that means we have to > deal with I18N. > > Worse, much worse than the problems Unicode brings with it, are the > problems of having either no clue what codeset some text is in > (interop failures result), or having to support many, many codesets > (trade one set of complexities for a bigger one). Clearly it is > better to just use Unicode for text in Internet protocols. > > It is also clear that we can't really have a decent one-size-fits-all > static Huffman coding table for text that may be written in any of > tens of scripts spanning ~100k codepoints. Now, perhaps we can > encourage the world to use URIs not IRIs and so on, but really, that > would be a step backwards. > > My proposal: > > - All text values in HTTP/2.0 that are also present in HTTP/1.1 > should be sent as either UTF-8 or ISO8859-1, with a one-bit tag to > indicate which it is. > ... Why do we need two options? Best regards, Julian
Received on Monday, 11 February 2013 08:36:17 UTC