W3C home > Mailing lists > Public > ietf-http-wg@w3.org > January to March 2014

Re: UTF-8 in URIs

From: Zhong Yu <zhong.j.yu@gmail.com>
Date: Wed, 15 Jan 2014 14:46:39 -0600
Message-ID: <CACuKZqF0oxcpJWYnDzzVSwzeJgQ4K18gZCynyYh0uJwY=4xHtA@mail.gmail.com>
To: Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
Can you give an example where an intermediary benefits from decoding
URI octets into unicodes?

On Wed, Jan 15, 2014 at 1:55 PM, Gabriel Montenegro
<Gabriel.Montenegro@microsoft.com> wrote:
> Hi folks,
>
>
>
> Some of us (cc line) have been discussing the unfortunate lack of
> determinism with respect to URI encoding in HTTP/1.1 and would like HTTP/2.0
> to improve upon the situation.
>
>
>
> I just opened this issue: https://github.com/http2/http2-spec/issues/342
> (enable determinism for URI encoding in HTTP/2.0).
>
>
>
> The “http” and “https” URI schemes don’t have a fixed encoding. The URI RFC
> (http://tools.ietf.org/html/rfc3986#section-2.5) talks about the generic
> syntax for URI components:
>
> Legacy URI components (before 2005) tend to use UTF-8 “or some other
> superset of the US-ASCII character encoding”
> New schemes (after 2005) have to use UTF-8 with percent encoding for
> reserved characters.
>
> The first bullet explains why we currently have non-determinism for “http”
> and “https” URIs. This is particularly problematic when parsing URIs at the
> server side or at intermediate proxies (e.g., when looking for a cache hit).
>
>
>
> Proposed goal: enable determinism for URI encoding in HTTP/2.0.
>
>
>
> Either (1)  a SETTING “SETTINGS_URI_ENCODING” or (2) an ":encoding" header.
> We favor option (2). In either case, the value to denote the charset would
> be a 32-bit integer equivalent to the “MIBenum” value in the IANA registry
> (http://www.iana.org/assignments/character-sets/character-sets.xhtml).
> Hence, the value would be 106 for UTF-8. The legacy behavior of
> non-determinism is indicated via the value 0. Notice that this is a reserved
> value for MIBenum.
>
>
>
> Note: We could use the charset name, but there are actually two "name"
> columns in the IANA table (with the "name" value possibly having multiple
> values, and the "preferred MIME name" not always being present). We would
> also have to define a name to denote the legacy behavior.
>
>
>
> Some use cases:
>
>
>
> 1.     A legacy client behind an HTTP/2.0 capable proxy, talking to an
> HTTP/2.0 capable server.
>
>
>
> The client will use HTTP/1.1 to talk to the proxy.  Without special
> out-of-band knowledge, the proxy will not know the encoding for sure so
> would have to turn off the assumption when talking to the server by setting
> the value to 0 to denote legacy behavior.
>
>
>
> 2.      A HTTP/2.0 capable client behind an HTTP/2.0 capable proxy, talking
> to a legacy server.
>
>
>
> The client will use HTTP/2.0 to talk to the proxy.  The never-standardized
> (and now expired) 3987bis added text about using the encoding of the
> containing HTML, e.g. iso-8859-1. If the server had that assumption, then
> the proxy has no way to know what the encoding was.
>
> Thus to get correct behavior, either the client has to turn the SETTING off
> (indicating legacy behavior) when going via a proxy and get no benefit, or
> else have a way for the client to pass via HTTP/2.0 what the encoding was of
> the containing document or protocol.  The SETTING allows the latter, in this
> case by using the value 4 for iso-8859-1. This information allows the
> proxy's interpretation, e.g., for purposes of looking for a cache hit,
> whereas today it would get a cache miss.
>
>
>
> Some pros and cons of these two mechanisms:
>
>
>
> 1.     SETTINGS would naturally allow to specify a general encoding like
> UTF-8 across all requests
>
> o    Pro: This uses HTTP/2.0 as defined.
>
> o    Cons: per-request changes are a bit hackish, requiring constantly
> sending this SETTING
>
>
>
> 2.     :encoding header would allow to specify the encoding on a per-request
> basis, but there would be no way to specify it in general.
>
> o    Pro: clean per-request scope for such use-cases
>
> o    Pro: benefits from header compression
>
> o    Cons: This implies sending it on every request per current rules in
> HTTP/2.0, or another exception to those rules (similar to :authority).
>
>
>
> Comments?
>
>
>
> Thanks,
>
>
>
> Gabriel
Received on Wednesday, 15 January 2014 20:47:06 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:14:23 UTC