Re: UTF-8 in URIs from Nicolas Mailhot on 2014-01-16 (ietf-http-wg@w3.org from January to March 2014)

From: Nicolas Mailhot <nicolas.mailhot@laposte.net>
Date: Thu, 16 Jan 2014 15:33:30 +0100
To: "Zhong Yu" <zhong.j.yu@gmail.com>
Cc: "Nicolas Mailhot" <nicolas.mailhot@laposte.net>, "Julian Reschke" <julian.reschke@gmx.de>, "Gabriel Montenegro" <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, "Osama Mazahir" <osamam@microsoft.com>, "Dave Thaler" <dthaler@microsoft.com>, "Mike Bishop" <michael.bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
Message-ID: <aba0c024677d0d0bf4dc944ef64da6b7.squirrel@arekh.dyndns.org>

Le Jeu 16 janvier 2014 12:25, Zhong Yu a écrit :
> There is no way to enforce UTF-8 on URIs; we cannot even enforce
> %-encoding, the server can always build proprietary encoding on top of
> ASCII chars (for its own convenience, not for being cryptic to others)
>
> URIs have never been supposed to be understandable by anyone other
> than the original server. I don't see how we can change that, unless
> we turn URI into a full blow language with structures, semantics, and
> a huge vocabulary.

Look, that is all nonsense.

URLs are treated as text in html documents. URL are treated as text in
logs and traffic consoles. URL are treated as text by web site designers
(otherwise all accesses would be in the form mywebsite.com/opaquenumber
and how many sites actually do that?). Web traffic is not direct
end-to-end it goes through intermediaries that need to decode part of the
http envelope and besides web sites are more and more inter penetrated
(URL soup aka mashup and clouds) so decoding has not been a private web
site affair for a long time

All those elements do not manipulate chains of bytes but text and the
difference between chains of bytes and text is clear encoding rules (I
know it is a huge understanding leap for most developers that didn't have
to deal extensively with encoding problem fallouts)

There is a difference between semantics (which are the business of web
sites) and technical encoding. I don't care a fig about what encoding a
web server uses on its filesystem or the encoding of web pages. What I
want is that the on-wire representation, that needs to be decoded by all
kinds of third parties for things to work smoothly, to be clearly defined
without the usual "chain of bytes" cop-out.

Software writers that love exotic encodings just have to translate URLs to
UTF-8 representation before http emission. Their choice. Interoperability
problems should be pushed to end-nodes not embedded in the protocol
itself.

I'd rather have clear interop rules and simple to understand breakage at
http2 adoption time rather than diffuse heisenbugs and sites that work by
sheer chance as long as you chain components with the same unwritten
assumptions. See the python 2 unicode debacle and how much fuzzy encoding
rules cost them.

-- 
Nicolas Mailhot

Received on Thursday, 16 January 2014 14:34:06 UTC