RE: UTF-8 in URIs from Larry Masinter on 2014-01-17 (ietf-http-wg@w3.org from January to March 2014)

From: Larry Masinter <masinter@adobe.com>
Date: Fri, 17 Jan 2014 07:50:40 +0000
To: Nicolas Mailhot <nicolas.mailhot@laposte.net>, "julian.reschke@gmx.de" <julian.reschke@gmx.de>
CC: Zhong Yu <zhong.j.yu@gmail.com>, Gabriel Montenegro <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <osamam@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <michael.bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
Message-ID: <6c83475affc04a43be4119488219c2ec@BL2PR02MB307.namprd02.prod.outlook.com>

It's a little hard to wade through the rhetoric ("dead trees" ?) but I don't see a problem with saying in HTTP/2.0 that a request can be an IRI or an IRI-path encoded in UTF-8. Also a "Host" header can have a UTF-8 encoded Unicode for IDN.

Gateways from HTTP/1 can leave the URI path decoded; probably should not change any values.
Gateways from HTTP/2 to HTTP/1 should percent-hex-encode any non-ASCII character in both host and path.

This is a new HTTP/2 feature, though, and is it worth it? It's more complexity and a small saving in space, all things considered.

Larry
--
http://larry.masinter.net

-----Original Message-----
From: Nicolas Mailhot [mailto:nicolas.mailhot@laposte.net] 
Sent: Thursday, January 16, 2014 7:09 AM
To: julian.reschke@gmx.de
Cc: Nicolas Mailhot; Zhong Yu; Gabriel Montenegro; ietf-http-wg@w3.org; Osama Mazahir; Dave Thaler; Mike Bishop; Matthew Cox
Subject: Re: UTF-8 in URIs

Le Jeu 16 janvier 2014 15:41, Julian Reschke a écrit :
> On 2014-01-16 15:33, Nicolas Mailhot wrote:
>>
>> Le Jeu 16 janvier 2014 12:25, Zhong Yu a écrit :
>>> There is no way to enforce UTF-8 on URIs; we cannot even enforce
>>> %-encoding, the server can always build proprietary encoding on top of
>>> ASCII chars (for its own convenience, not for being cryptic to others)
>>>
>>> URIs have never been supposed to be understandable by anyone other
>>> than the original server. I don't see how we can change that, unless
>>> we turn URI into a full blow language with structures, semantics, and
>>> a huge vocabulary.
>>
>> Look, that is all nonsense.
>
> Um, no.
>
>> URLs are treated as text in html documents. URL are treated as text in
>> logs and traffic consoles. URL are treated as text by web site designers
>> (otherwise all accesses would be in the form mywebsite.com/opaquenumber
>> and how many sites actually do that?). Web traffic is not direct
>
> Yes. So?
>
>> end-to-end it goes through intermediaries that need to decode part of
>> the
>> http envelope and besides web sites are more and more inter penetrated
>> (URL soup aka mashup and clouds) so decoding has not been a private web
>> site affair for a long time
>
> I still don't understand why intermediaries "need" to "decode" request
> URIs.

Because you want to write intermediary processing rules in text form, just
like server sites write their rules in text form, and the web browser user
writes his request in text form, and nobody wants to write his rules in
binary because the encoding of the processed objects is undefined

Because traffic consoles that displays octet value chains are useless in
practical terms.

Because web objects are identified by urls and the identifier changing
depending on random client/server encoding choices increases the
complexity level way over just telling everyone "write your urls in http2
in utf-8".

Because there *are* semantics in web site organisation but they are only
apparent in the text encoding form the site creator used.

Because all the systems that tried to jungle multiple implicit encodings
instead of imposing a single rule have been pathetic failures (they "work"
as long as all the actors to not use the multiple encoding freedom but add
the encoding convention the designer forgot to provide)

-- 
Nicolas Mailhot

Received on Friday, 17 January 2014 07:51:27 UTC