Re: Delta Compression and UTF-8 Header Values from Nico Williams on 2013-02-11 (ietf-http-wg@w3.org from January to March 2013)

From: Nico Williams <nico@cryptonector.com>
Date: Sun, 10 Feb 2013 18:09:14 -0600
To: Zhong Yu <zhong.j.yu@gmail.com>
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Julian Reschke <julian.reschke@gmx.de>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, James M Snell <jasnell@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CAK3OfOi+cXMLGsMCpD1cRBxzz46wVYYj8nz021fhqhM7fTDMWA@mail.gmail.com>

On Sun, Feb 10, 2013 at 4:49 PM, Zhong Yu <zhong.j.yu@gmail.com> wrote:
> On Sun, Feb 10, 2013 at 4:24 AM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote:
>>>1) Filenames in Content-Disposition
>>
>> These only have meaning to the ultimate destinations, and if their
>> filesystems don't support UTF-8, they'll have to do $something anyway.

The filesystems pretty much do support either UTF-8 or "just-use-8".
In general "just-use-8" only really interops if everyone uses the same
codeset, and the only codeset we have that can be used universally
is... Unicode.

>> Nobody in the HTTP/2 protocol-chain can do anything but treat this
>> as an opaque bytestring.
>
> But how does the 2 ends agree on which encoding to use? It might be
> easier if HTTP just dictate UTF-8.

Not might be.  Will be.

We've done this in many other protocols.  In general we must either
tag text with codeset metadata or declare that Unicode (UTF-8,
generally) SHALL be used in the middle (and pushing codeset
conversions to the edge.  No character set other than Unicode is
suitable for use "in the middle", and tagging strings with codeset
metadata is particularly difficult.

It might be useful to go over what we've done in filesystems and
remote/distributed filesystem protocols.  Very briefly, in ZFS we
implemented fast normalization-insensitive string comparison and
hashing functionality; the filesystem has an option to reject any
non-UTF-8 byte sequences, but otherwise never normalizes on CREATE
(compare to HFS+).  Meanwhile NFSv4 calls for using only UTF-8 on the
wire.  This works.  It works *really* well.  The code is even open
source.  Filesystems are a great example of an application where
tagging strings with codeset metadata doesn't work: we'd need to push
process setlocale information into the kernel, and tag strings all the
way from the system call boundary -through the VFS- to the filesystem
driver -- with consequent impact on stable interfaces up and down the
stack, and massive code modifications requirements.

Filesystems are not the only example of this, but because filesystems
cross so many layers in our stacks (user-land APIs, kernel-land APIs,
on-the-wire protocols, on-disk formats) they are perhaps the best
example.

UTF-8 in the middle.

Nico
--

Received on Monday, 11 February 2013 00:09:38 UTC