Re: UTF-8 in URIs from Nicolas Mailhot on 2014-01-16 (ietf-http-wg@w3.org from January to March 2014)

From: Nicolas Mailhot <nicolas.mailhot@laposte.net>
Date: Thu, 16 Jan 2014 11:24:01 +0100
To: "Julian Reschke" <julian.reschke@gmx.de>
Cc: "Nicolas Mailhot" <nicolas.mailhot@laposte.net>, "Zhong Yu" <zhong.j.yu@gmail.com>, "Gabriel Montenegro" <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, "Osama Mazahir" <osamam@microsoft.com>, "Dave Thaler" <dthaler@microsoft.com>, "Mike Bishop" <michael.bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
Message-ID: <34d061601af054efaf01429619fd0098.squirrel@arekh.dyndns.org>

Le Jeu 16 janvier 2014 11:06, Julian Reschke a écrit :
> On 2014-01-16 10:52, Nicolas Mailhot wrote:
>>
>> Le Mer 15 janvier 2014 21:46, Zhong Yu a écrit :
>>> Can you give an example where an intermediary benefits from decoding
>>> URI octets into unicodes?
>>
>> Intermediaries can not perform URL-based filtering it they can not
>> decode
>> URLS reliably. Intermediaries need to normalise URLs to a single
>> encoding
>> if they log them (for debugging or policy purposes). unix-like "just a
>> bunch of bytes with no encoding indication" is an i18n disaster
>> supported
>> only by users of ASCII scripts
>
> Well, you could log what you got on the wire. It's ASCII.

And it's useless if you can't interpret it reliably. May as well log the
output of /dev/random at the time. Don't have time to get humans comb
millions of log lines to fix encoding errors.

>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in XML,
>> that's one part of the XML spec that worked very well) and require
>> http/1
>> to 2 bridges to translate to the canonical form. Helping clients push
>> local 8bits encodings will just perpetuate pre-2000 legacy mess.
>
> How do you translate a URI with unknown URI encoding to UTF-8?

You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with an
error. That will make people fix their encodings quickly.

>> Whenever someone specifies a new better encoding it will be time for
>> HTTP/3. Unicode specs are way more complex than http, changes won't
>> happen
>> quicker than http revisions.
>
> The problem here is that HTTP URIs are octet sequences, not character
> sequences.

The problem is that octet sequences are useless by themselves if you can
not decode them.


-- 
Nicolas Mailhot

Received on Thursday, 16 January 2014 10:24:31 UTC