UTF-8 in URIs

Hi folks,

Some of us (cc line) have been discussing the unfortunate lack of determinism with respect to URI encoding in HTTP/1.1 and would like HTTP/2.0 to improve upon the situation.

I just opened this issue: https://github.com/http2/http2-spec/issues/342 (enable determinism for URI encoding in HTTP/2.0).

The "http" and "https" URI schemes don't have a fixed encoding. The URI RFC (http://tools.ietf.org/html/rfc3986#section-2.5) talks about the generic syntax for URI components:

  *   Legacy URI components (before 2005) tend to use UTF-8 "or some other superset of the US-ASCII character encoding"
  *   New schemes (after 2005) have to use UTF-8 with percent encoding for reserved characters.
The first bullet explains why we currently have non-determinism for "http" and "https" URIs. This is particularly problematic when parsing URIs at the server side or at intermediate proxies (e.g., when looking for a cache hit).

Proposed goal: enable determinism for URI encoding in HTTP/2.0.

Either (1)  a SETTING "SETTINGS_URI_ENCODING" or (2) an ":encoding" header. We favor option (2). In either case, the value to denote the charset would be a 32-bit integer equivalent to the "MIBenum" value in the IANA registry (http://www.iana.org/assignments/character-sets/character-sets.xhtml). Hence, the value would be 106 for UTF-8. The legacy behavior of non-determinism is indicated via the value 0. Notice that this is a reserved value for MIBenum.

Note: We could use the charset name, but there are actually two "name" columns in the IANA table (with the "name" value possibly having multiple values, and the "preferred MIME name" not always being present). We would also have to define a name to denote the legacy behavior.

Some use cases:

1.     A legacy client behind an HTTP/2.0 capable proxy, talking to an HTTP/2.0 capable server.

The client will use HTTP/1.1 to talk to the proxy.  Without special out-of-band knowledge, the proxy will not know the encoding for sure so would have to turn off the assumption when talking to the server by setting the value to 0 to denote legacy behavior.

2.      A HTTP/2.0 capable client behind an HTTP/2.0 capable proxy, talking to a legacy server.

The client will use HTTP/2.0 to talk to the proxy.  The never-standardized (and now expired) 3987bis added text about using the encoding of the containing HTML, e.g. iso-8859-1. If the server had that assumption, then the proxy has no way to know what the encoding was.
Thus to get correct behavior, either the client has to turn the SETTING off (indicating legacy behavior) when going via a proxy and get no benefit, or else have a way for the client to pass via HTTP/2.0 what the encoding was of the containing document or protocol.  The SETTING allows the latter, in this case by using the value 4 for iso-8859-1. This information allows the proxy's interpretation, e.g., for purposes of looking for a cache hit, whereas today it would get a cache miss.

Some pros and cons of these two mechanisms:

1.     SETTINGS would naturally allow to specify a general encoding like UTF-8 across all requests
o    Pro: This uses HTTP/2.0 as defined.
o    Cons: per-request changes are a bit hackish, requiring constantly sending this SETTING

2.     :encoding header would allow to specify the encoding on a per-request basis, but there would be no way to specify it in general.
o    Pro: clean per-request scope for such use-cases
o    Pro: benefits from header compression
o    Cons: This implies sending it on every request per current rules in HTTP/2.0, or another exception to those rules (similar to :authority).

Comments?

Thanks,

Gabriel

Received on Wednesday, 15 January 2014 19:56:01 UTC