RE: *** GMX Spamverdacht *** Re: UTF-8 in URIs from Gabriel Montenegro on 2014-01-16 (ietf-http-wg@w3.org from January to March 2014)

From: Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>
Date: Thu, 16 Jan 2014 18:48:47 +0000
To: Julian Reschke <julian.reschke@gmx.de>, Nicolas Mailhot <nicolas.mailhot@laposte.net>
CC: Zhong Yu <zhong.j.yu@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
Message-ID: <b5c632977d47422fac1e92a99133226d@BN1PR03MB072.namprd03.prod.outlook.com>

To clarify: The proposal is *NOT* to impose a default of UTF-8 in URIs for HTTP/2.0. As some have mentioned, there are too many legacy issues.

The issue is that since http and https are legacy scheme's from the point of view of rfc3986, they don't have a fixed encoding. If somebody were to define a new scheme, say, "http2" that would benefit from rfc3986 rules so the encoding would be known: UTF-8 with percent encoding.

Unfortunately, for http and http2 URI handling at either the proxy (to check for a cache hit) or at the origin server is non-deterministic. Several encodings are tried until one works. Such non-determinism is also a potential security issue, as a URI could decode in more than one way as several encoding are tried. 

However, iff an HTTP/2.0 client knows for sure the encoding (e.g., UTF-8), per the proposal it could indicate it so at the receiving side there are no guessing games: in the presence of such an explicit indication, either it is valid UTF-8, or it is an error, no further processing is done.

-----Original Message-----
From: Julian Reschke [mailto:julian.reschke@gmx.de] 
Sent: Thursday, January 16, 2014 7:12
To: Nicolas Mailhot
Cc: Zhong Yu; Gabriel Montenegro; ietf-http-wg@w3.org; Osama Mazahir; Dave Thaler; Mike Bishop; Matthew Cox
Subject: Re: *** GMX Spamverdacht *** Re: UTF-8 in URIs

On 2014-01-16 15:57, Nicolas Mailhot wrote:
> ...
>>> And it's useless if you can't interpret it reliably. May as well log 
>>> the output of /dev/random at the time. Don't have time to get humans 
>>> comb millions of log lines to fix encoding errors.
>>
>> Define "encoding error" in the context of a URI.
>
> Any URI that can not be reliably decoded in the textual representation 
> the URL creator preferred by a random http processor (web site, 
> intermediary, web client) without outside help.

A valid URI is all US-ASCII. There's nothing that needs to be decoded at all.

> And there *is* a preferred textual representation because you know, 
> people do not enter URLs in binary editors.
>
>>>>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in 
>>>>> XML, that's one part of the XML spec that worked very well) and 
>>>>> require
>>>>> http/1
>>>>> to 2 bridges to translate to the canonical form. Helping clients 
>>>>> push local 8bits encodings will just perpetuate pre-2000 legacy mess.
>>>>
>>>> How do you translate a URI with unknown URI encoding to UTF-8?
>>>
>>> You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with 
>>> an error. That will make people fix their encodings quickly.
>>
>> This is not going to work:
>>
>> a) People may have chosen a non-UTF8 encoding by accident (system 
>> locale
>> etc) and can't change it retroactively,
>
> They can add an UTF-8 translator gateway at http/2 adoption time. No 
> different and much easier than the mass of documents that needed to be 
> fixed once people started exchanging them in binary not dead wood form.

And that translator rewrites all URIs that appear in payloads?

> Some past mistakes need correction you can't grandfather them 
> eternally at the cost of eternal future interop problems.

The only interop problem I'm aware of is when clients construct *new* URIs, such as is common in WebDAV. The way to fix this is to *advocate* UTF-8. But even if everybody agrees on UTF-8 there's still the NFC/NFD mismatch between OSX and the rest of the world.

>> b) There might be actual *binary* data in the URI.
>
> So just define the canonical binary-to-utf8 mapping. If you don't your 
> URL will crash as soon as it needs to be displayed in an address bar, 
> network console or activity log. Again, much easier to define a single
> binary-to-utf8 mapping than random encoding to random display encoding 
> rules (hint: it is not possible and that's the core problem).

I still have no clue what problem you are trying to solve. Sorry.

 > ...
>> Hm, no. They just happen to work in a way different from your 
>> preference, but they do just work fine.
>
> No, they don't work. Working is not "avoid any automated processing 
> and for christsakes never use anything but ASCII because the state is 
> undefined and things will randomly break"

The state is fully defined. It's just that you don't like that state.

> ...

It seems we aren't getting anywhere.

Can somebody else help me understanding what this is all about? :-)

Best regards, Julian

Received on Thursday, 16 January 2014 18:49:17 UTC