RE: UTF-8 in URIs from Dave Thaler on 2014-01-15 (ietf-http-wg@w3.org from January to March 2014)

From: Dave Thaler <dthaler@microsoft.com>
Date: Wed, 15 Jan 2014 20:14:43 +0000
To: Michael Sweet <msweet@apple.com>, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>
CC: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
Message-ID: <02d23d1fa60a4a559c49850175e8d59e@BY2PR03MB412.namprd03.prod.outlook.com>

As long as the normalization form is NFC or NFD (i.e. not NFK*) then you don't need to say anything more,
since conversion between NFC and NFD is lossless.  The receiver can convert to its local form without additional
information.

-Dave

From: Michael Sweet [mailto:msweet@apple.com]
Sent: Wednesday, January 15, 2014 12:07 PM
To: Gabriel Montenegro
Cc: ietf-http-wg@w3.org; Osama Mazahir; Dave Thaler; Mike Bishop; Matthew Cox
Subject: Re: UTF-8 in URIs

Gabriel,

Encoding is only one aspect of this; the normalization form varies between OS's, so you'd need to say something (at least) about the normalization form to use with UTF-8 and other Unicode encodings - probably as a server requirement to do normalization to the local form that is used for files, etc.

On Jan 15, 2014, at 2:55 PM, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com<mailto:Gabriel.Montenegro@microsoft.com>> wrote:

Hi folks,

Some of us (cc line) have been discussing the unfortunate lack of determinism with respect to URI encoding in HTTP/1.1 and would like HTTP/2.0 to improve upon the situation.

I just opened this issue: https://github.com/http2/http2-spec/issues/342 (enable determinism for URI encoding in HTTP/2.0).

The "http" and "https" URI schemes don't have a fixed encoding. The URI RFC (http://tools.ietf.org/html/rfc3986#section-2.5) talks about the generic syntax for URI components:

  *   Legacy URI components (before 2005) tend to use UTF-8 "or some other superset of the US-ASCII character encoding"
  *   New schemes (after 2005) have to use UTF-8 with percent encoding for reserved characters.
The first bullet explains why we currently have non-determinism for "http" and "https" URIs. This is particularly problematic when parsing URIs at the server side or at intermediate proxies (e.g., when looking for a cache hit).

Proposed goal: enable determinism for URI encoding in HTTP/2.0.

Either (1)  a SETTING "SETTINGS_URI_ENCODING" or (2) an ":encoding" header. We favor option (2). In either case, the value to denote the charset would be a 32-bit integer equivalent to the "MIBenum" value in the IANA registry (http://www.iana.org/assignments/character-sets/character-sets.xhtml). Hence, the value would be 106 for UTF-8. The legacy behavior of non-determinism is indicated via the value 0. Notice that this is a reserved value for MIBenum.

Note: We could use the charset name, but there are actually two "name" columns in the IANA table (with the "name" value possibly having multiple values, and the "preferred MIME name" not always being present). We would also have to define a name to denote the legacy behavior.

Some use cases:

  1.  A legacy client behind an HTTP/2.0 capable proxy, talking to an HTTP/2.0 capable server.

The client will use HTTP/1.1 to talk to the proxy.  Without special out-of-band knowledge, the proxy will not know the encoding for sure so would have to turn off the assumption when talking to the server by setting the value to 0 to denote legacy behavior.

  1.   A HTTP/2.0 capable client behind an HTTP/2.0 capable proxy, talking to a legacy server.

The client will use HTTP/2.0 to talk to the proxy.  The never-standardized (and now expired) 3987bis added text about using the encoding of the containing HTML, e.g. iso-8859-1. If the server had that assumption, then the proxy has no way to know what the encoding was.
Thus to get correct behavior, either the client has to turn the SETTING off (indicating legacy behavior) when going via a proxy and get no benefit, or else have a way for the client to pass via HTTP/2.0 what the encoding was of the containing document or protocol.  The SETTING allows the latter, in this case by using the value 4 for iso-8859-1. This information allows the proxy's interpretation, e.g., for purposes of looking for a cache hit, whereas today it would get a cache miss.

Some pros and cons of these two mechanisms:

  1.  SETTINGS would naturally allow to specify a general encoding like UTF-8 across all requests

     *   Pro: This uses HTTP/2.0 as defined.
     *   Cons: per-request changes are a bit hackish, requiring constantly sending this SETTING

  1.  :encoding header would allow to specify the encoding on a per-request basis, but there would be no way to specify it in general.

     *   Pro: clean per-request scope for such use-cases
     *   Pro: benefits from header compression
     *   Cons: This implies sending it on every request per current rules in HTTP/2.0, or another exception to those rules (similar to :authority).

Comments?

Thanks,

Gabriel

_________________________________________________________
Michael Sweet, Senior Printing System Engineer, PWG Chair

Received on Wednesday, 15 January 2014 20:15:26 UTC