- From: Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>
- Date: Thu, 16 Jan 2014 18:48:47 +0000
- To: Julian Reschke <julian.reschke@gmx.de>, Nicolas Mailhot <nicolas.mailhot@laposte.net>
- CC: Zhong Yu <zhong.j.yu@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
To clarify: The proposal is *NOT* to impose a default of UTF-8 in URIs for HTTP/2.0. As some have mentioned, there are too many legacy issues. The issue is that since http and https are legacy scheme's from the point of view of rfc3986, they don't have a fixed encoding. If somebody were to define a new scheme, say, "http2" that would benefit from rfc3986 rules so the encoding would be known: UTF-8 with percent encoding. Unfortunately, for http and http2 URI handling at either the proxy (to check for a cache hit) or at the origin server is non-deterministic. Several encodings are tried until one works. Such non-determinism is also a potential security issue, as a URI could decode in more than one way as several encoding are tried. However, iff an HTTP/2.0 client knows for sure the encoding (e.g., UTF-8), per the proposal it could indicate it so at the receiving side there are no guessing games: in the presence of such an explicit indication, either it is valid UTF-8, or it is an error, no further processing is done. -----Original Message----- From: Julian Reschke [mailto:julian.reschke@gmx.de] Sent: Thursday, January 16, 2014 7:12 To: Nicolas Mailhot Cc: Zhong Yu; Gabriel Montenegro; ietf-http-wg@w3.org; Osama Mazahir; Dave Thaler; Mike Bishop; Matthew Cox Subject: Re: *** GMX Spamverdacht *** Re: UTF-8 in URIs On 2014-01-16 15:57, Nicolas Mailhot wrote: > ... >>> And it's useless if you can't interpret it reliably. May as well log >>> the output of /dev/random at the time. Don't have time to get humans >>> comb millions of log lines to fix encoding errors. >> >> Define "encoding error" in the context of a URI. > > Any URI that can not be reliably decoded in the textual representation > the URL creator preferred by a random http processor (web site, > intermediary, web client) without outside help. A valid URI is all US-ASCII. There's nothing that needs to be decoded at all. > And there *is* a preferred textual representation because you know, > people do not enter URLs in binary editors. > >>>>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in >>>>> XML, that's one part of the XML spec that worked very well) and >>>>> require >>>>> http/1 >>>>> to 2 bridges to translate to the canonical form. Helping clients >>>>> push local 8bits encodings will just perpetuate pre-2000 legacy mess. >>>> >>>> How do you translate a URI with unknown URI encoding to UTF-8? >>> >>> You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with >>> an error. That will make people fix their encodings quickly. >> >> This is not going to work: >> >> a) People may have chosen a non-UTF8 encoding by accident (system >> locale >> etc) and can't change it retroactively, > > They can add an UTF-8 translator gateway at http/2 adoption time. No > different and much easier than the mass of documents that needed to be > fixed once people started exchanging them in binary not dead wood form. And that translator rewrites all URIs that appear in payloads? > Some past mistakes need correction you can't grandfather them > eternally at the cost of eternal future interop problems. The only interop problem I'm aware of is when clients construct *new* URIs, such as is common in WebDAV. The way to fix this is to *advocate* UTF-8. But even if everybody agrees on UTF-8 there's still the NFC/NFD mismatch between OSX and the rest of the world. >> b) There might be actual *binary* data in the URI. > > So just define the canonical binary-to-utf8 mapping. If you don't your > URL will crash as soon as it needs to be displayed in an address bar, > network console or activity log. Again, much easier to define a single > binary-to-utf8 mapping than random encoding to random display encoding > rules (hint: it is not possible and that's the core problem). I still have no clue what problem you are trying to solve. Sorry. > ... >> Hm, no. They just happen to work in a way different from your >> preference, but they do just work fine. > > No, they don't work. Working is not "avoid any automated processing > and for christsakes never use anything but ASCII because the state is > undefined and things will randomly break" The state is fully defined. It's just that you don't like that state. > ... It seems we aren't getting anywhere. Can somebody else help me understanding what this is all about? :-) Best regards, Julian
Received on Thursday, 16 January 2014 18:49:17 UTC