- From: Nicolas Mailhot <nicolas.mailhot@laposte.net>
- Date: Thu, 16 Jan 2014 15:33:30 +0100
- To: "Zhong Yu" <zhong.j.yu@gmail.com>
- Cc: "Nicolas Mailhot" <nicolas.mailhot@laposte.net>, "Julian Reschke" <julian.reschke@gmx.de>, "Gabriel Montenegro" <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, "Osama Mazahir" <osamam@microsoft.com>, "Dave Thaler" <dthaler@microsoft.com>, "Mike Bishop" <michael.bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
Le Jeu 16 janvier 2014 12:25, Zhong Yu a écrit : > There is no way to enforce UTF-8 on URIs; we cannot even enforce > %-encoding, the server can always build proprietary encoding on top of > ASCII chars (for its own convenience, not for being cryptic to others) > > URIs have never been supposed to be understandable by anyone other > than the original server. I don't see how we can change that, unless > we turn URI into a full blow language with structures, semantics, and > a huge vocabulary. Look, that is all nonsense. URLs are treated as text in html documents. URL are treated as text in logs and traffic consoles. URL are treated as text by web site designers (otherwise all accesses would be in the form mywebsite.com/opaquenumber and how many sites actually do that?). Web traffic is not direct end-to-end it goes through intermediaries that need to decode part of the http envelope and besides web sites are more and more inter penetrated (URL soup aka mashup and clouds) so decoding has not been a private web site affair for a long time All those elements do not manipulate chains of bytes but text and the difference between chains of bytes and text is clear encoding rules (I know it is a huge understanding leap for most developers that didn't have to deal extensively with encoding problem fallouts) There is a difference between semantics (which are the business of web sites) and technical encoding. I don't care a fig about what encoding a web server uses on its filesystem or the encoding of web pages. What I want is that the on-wire representation, that needs to be decoded by all kinds of third parties for things to work smoothly, to be clearly defined without the usual "chain of bytes" cop-out. Software writers that love exotic encodings just have to translate URLs to UTF-8 representation before http emission. Their choice. Interoperability problems should be pushed to end-nodes not embedded in the protocol itself. I'd rather have clear interop rules and simple to understand breakage at http2 adoption time rather than diffuse heisenbugs and sites that work by sheer chance as long as you chain components with the same unwritten assumptions. See the python 2 unicode debacle and how much fuzzy encoding rules cost them. -- Nicolas Mailhot
Received on Thursday, 16 January 2014 14:34:06 UTC