Re: UTF-8 in URIs

Hello Gabriel,

Responding to your original post after having read through the thread.

On 2014/01/16 4:55, Gabriel Montenegro wrote:
> Hi folks,
>
> Some of us (cc line) have been discussing the unfortunate lack of determinism with respect to URI encoding in HTTP/1.1 and would like HTTP/2.0 to improve upon the situation.
>
> I just opened this issue: https://github.com/http2/http2-spec/issues/342 (enable determinism for URI encoding in HTTP/2.0).
>
> The "http" and "https" URI schemes don't have a fixed encoding. The URI RFC (http://tools.ietf.org/html/rfc3986#section-2.5) talks about the generic syntax for URI components:
>
>    *   Legacy URI components (before 2005) tend to use UTF-8 "or some other superset of the US-ASCII character encoding"
>    *   New schemes (after 2005) have to use UTF-8 with percent encoding for reserved characters.

Please note that there are a lot of schemes that indeed do so, even 
before 2005, at least if implemented per spec. For examples, see 
http://tools.ietf.org/search/rfc3987#section-1.2, point c).

On the other hand, there are schemes that are clearly post 2005, but 
which seem to work exactly as http/https. As far as I understand, ws/wss 
works that way (using the same codepaths in browsers as http/https), 
although I haven't tested it and would of course be pleasantly surprised 
if I was wrong.

> The first bullet explains why we currently have non-determinism for "http" and "https" URIs. This is particularly problematic when parsing URIs at the server side or at intermediate proxies (e.g., when looking for a cache hit).
>
> Proposed goal: enable determinism for URI encoding in HTTP/2.0.
>
> Either (1)  a SETTING "SETTINGS_URI_ENCODING" or (2) an ":encoding" header. We favor option (2). In either case, the value to denote the charset would be a 32-bit integer equivalent to the "MIBenum" value in the IANA registry (http://www.iana.org/assignments/character-sets/character-sets.xhtml). Hence, the value would be 106 for UTF-8. The legacy behavior of non-determinism is indicated via the value 0. Notice that this is a reserved value for MIBenum.

This might work in theory, but it would be a very late fix to a 
long-standing problem that would either work out but push things in the 
wrong direction, or fail anyway.

I think a less ambiguous thing could do more good and push the Web more 
in the right direction: Just have two values, one for UTF-8 and one for 
"unknown". Actually, because UTF-8 is easily detected heuristically, we 
are already close to that.

> Note: We could use the charset name, but there are actually two "name" columns in the IANA table (with the "name" value possibly having multiple values, and the "preferred MIME name" not always being present). We would also have to define a name to denote the legacy behavior.

I agree that the number is easier to handle than the name. It also fits 
better with the binary nature of HTTP2. I don't think we need a 32-bit 
value, 16 bits would be enough, in case we want to save bits. There are 
a few corner cases that would need careful examination. As an example, 
on the Web, the label 'iso-8859-1' in practice often means 
'windows-1252', so we would need to specify what to do in such cases.

> Some use cases:
>
> 1.     A legacy client behind an HTTP/2.0 capable proxy, talking to an HTTP/2.0 capable server.
>
> The client will use HTTP/1.1 to talk to the proxy.  Without special out-of-band knowledge, the proxy will not know the encoding for sure so would have to turn off the assumption when talking to the server by setting the value to 0 to denote legacy behavior.
>
> 2.      A HTTP/2.0 capable client behind an HTTP/2.0 capable proxy, talking to a legacy server.
>
> The client will use HTTP/2.0 to talk to the proxy.  The never-standardized (and now expired) 3987bis added text about using the encoding of the containing HTML, e.g. iso-8859-1.

That was only for the query part, and it was added because that's what 
browsers do in practice, for the query part. The path part still is 
(supposed to be) UTF-8, as again that's what browsers do in practice.

So you can have two encodings in the same URI. How would your proposal 
handle such cases?

Regards,    Martin.

If the server had that assumption, then the proxy has no way to know 
what the encoding was.
> Thus to get correct behavior, either the client has to turn the SETTING off (indicating legacy behavior) when going via a proxy and get no benefit, or else have a way for the client to pass via HTTP/2.0 what the encoding was of the containing document or protocol.  The SETTING allows the latter, in this case by using the value 4 for iso-8859-1. This information allows the proxy's interpretation, e.g., for purposes of looking for a cache hit, whereas today it would get a cache miss.
>
> Some pros and cons of these two mechanisms:
>
> 1.     SETTINGS would naturally allow to specify a general encoding like UTF-8 across all requests
> o    Pro: This uses HTTP/2.0 as defined.
> o    Cons: per-request changes are a bit hackish, requiring constantly sending this SETTING
>
> 2.     :encoding header would allow to specify the encoding on a per-request basis, but there would be no way to specify it in general.
> o    Pro: clean per-request scope for such use-cases
> o    Pro: benefits from header compression
> o    Cons: This implies sending it on every request per current rules in HTTP/2.0, or another exception to those rules (similar to :authority).
>
> Comments?
>
> Thanks,
>
> Gabriel
>

Received on Friday, 17 January 2014 08:00:40 UTC