Re: UTF-8 in URIs from Julian Reschke on 2014-01-16 (ietf-http-wg@w3.org from January to March 2014)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 16 Jan 2014 15:41:52 +0100
To: Nicolas Mailhot <nicolas.mailhot@laposte.net>, Zhong Yu <zhong.j.yu@gmail.com>
CC: Gabriel Montenegro <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <osamam@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <michael.bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
Message-ID: <52D7EFB0.6030808@gmx.de>

On 2014-01-16 15:33, Nicolas Mailhot wrote:
>
> Le Jeu 16 janvier 2014 12:25, Zhong Yu a écrit :
>> There is no way to enforce UTF-8 on URIs; we cannot even enforce
>> %-encoding, the server can always build proprietary encoding on top of
>> ASCII chars (for its own convenience, not for being cryptic to others)
>>
>> URIs have never been supposed to be understandable by anyone other
>> than the original server. I don't see how we can change that, unless
>> we turn URI into a full blow language with structures, semantics, and
>> a huge vocabulary.
>
> Look, that is all nonsense.

Um, no.

> URLs are treated as text in html documents. URL are treated as text in
> logs and traffic consoles. URL are treated as text by web site designers
> (otherwise all accesses would be in the form mywebsite.com/opaquenumber
> and how many sites actually do that?). Web traffic is not direct

Yes. So?

> end-to-end it goes through intermediaries that need to decode part of the
> http envelope and besides web sites are more and more inter penetrated
> (URL soup aka mashup and clouds) so decoding has not been a private web
> site affair for a long time

I still don't understand why intermediaries "need" to "decode" request URIs.

> All those elements do not manipulate chains of bytes but text and the
> difference between chains of bytes and text is clear encoding rules (I
> know it is a huge understanding leap for most developers that didn't have
> to deal extensively with encoding problem fallouts)

The URI on the wire is indeed a sequence of ASCII characters (well, a 
legal one). The fact that non-ASCII characters and delimiters can be 
embedded using percent-escaping doesn't change that fact.

> There is a difference between semantics (which are the business of web
> sites) and technical encoding. I don't care a fig about what encoding a
> web server uses on its filesystem or the encoding of web pages. What I
> want is that the on-wire representation, that needs to be decoded by all
> kinds of third parties for things to work smoothly, to be clearly defined
> without the usual "chain of bytes" cop-out.

Again: please clarify why it needs to be "decoded" by anybody except the 
origin server.

> ...

Best regards, Julian

Received on Thursday, 16 January 2014 14:42:32 UTC