- From: Nico Williams <nico@cryptonector.com>
- Date: Sun, 10 Feb 2013 18:09:14 -0600
- To: Zhong Yu <zhong.j.yu@gmail.com>
- Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Julian Reschke <julian.reschke@gmx.de>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
On Sun, Feb 10, 2013 at 4:49 PM, Zhong Yu <zhong.j.yu@gmail.com> wrote: > On Sun, Feb 10, 2013 at 4:24 AM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote: >>>1) Filenames in Content-Disposition >> >> These only have meaning to the ultimate destinations, and if their >> filesystems don't support UTF-8, they'll have to do $something anyway. The filesystems pretty much do support either UTF-8 or "just-use-8". In general "just-use-8" only really interops if everyone uses the same codeset, and the only codeset we have that can be used universally is... Unicode. >> Nobody in the HTTP/2 protocol-chain can do anything but treat this >> as an opaque bytestring. > > But how does the 2 ends agree on which encoding to use? It might be > easier if HTTP just dictate UTF-8. Not might be. Will be. We've done this in many other protocols. In general we must either tag text with codeset metadata or declare that Unicode (UTF-8, generally) SHALL be used in the middle (and pushing codeset conversions to the edge. No character set other than Unicode is suitable for use "in the middle", and tagging strings with codeset metadata is particularly difficult. It might be useful to go over what we've done in filesystems and remote/distributed filesystem protocols. Very briefly, in ZFS we implemented fast normalization-insensitive string comparison and hashing functionality; the filesystem has an option to reject any non-UTF-8 byte sequences, but otherwise never normalizes on CREATE (compare to HFS+). Meanwhile NFSv4 calls for using only UTF-8 on the wire. This works. It works *really* well. The code is even open source. Filesystems are a great example of an application where tagging strings with codeset metadata doesn't work: we'd need to push process setlocale information into the kernel, and tag strings all the way from the system call boundary -through the VFS- to the filesystem driver -- with consequent impact on stable interfaces up and down the stack, and massive code modifications requirements. Filesystems are not the only example of this, but because filesystems cross so many layers in our stacks (user-land APIs, kernel-land APIs, on-the-wire protocols, on-disk formats) they are perhaps the best example. UTF-8 in the middle. Nico --
Received on Monday, 11 February 2013 00:09:38 UTC