Re: UTF-8 in URIs from Zhong Yu on 2014-01-16 (ietf-http-wg@w3.org from January to March 2014)

From: Zhong Yu <zhong.j.yu@gmail.com>
Date: Thu, 16 Jan 2014 05:25:04 -0600
To: Nicolas Mailhot <nicolas.mailhot@laposte.net>
Cc: Julian Reschke <julian.reschke@gmx.de>, Gabriel Montenegro <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <osamam@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <michael.bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
Message-ID: <CACuKZqEuL2Roib7yw9vvo1XXQuxRvXOXe==8vXKh=d=qjE92Eg@mail.gmail.com>

There is no way to enforce UTF-8 on URIs; we cannot even enforce
%-encoding, the server can always build proprietary encoding on top of
ASCII chars (for its own convenience, not for being cryptic to others)

URIs have never been supposed to be understandable by anyone other
than the original server. I don't see how we can change that, unless
we turn URI into a full blow language with structures, semantics, and
a huge vocabulary.

Zhong Yu



On Thu, Jan 16, 2014 at 4:24 AM, Nicolas Mailhot
<nicolas.mailhot@laposte.net> wrote:
>
> Le Jeu 16 janvier 2014 11:06, Julian Reschke a écrit :
>> On 2014-01-16 10:52, Nicolas Mailhot wrote:
>>>
>>> Le Mer 15 janvier 2014 21:46, Zhong Yu a écrit :
>>>> Can you give an example where an intermediary benefits from decoding
>>>> URI octets into unicodes?
>>>
>>> Intermediaries can not perform URL-based filtering it they can not
>>> decode
>>> URLS reliably. Intermediaries need to normalise URLs to a single
>>> encoding
>>> if they log them (for debugging or policy purposes). unix-like "just a
>>> bunch of bytes with no encoding indication" is an i18n disaster
>>> supported
>>> only by users of ASCII scripts
>>
>> Well, you could log what you got on the wire. It's ASCII.
>
> And it's useless if you can't interpret it reliably. May as well log the
> output of /dev/random at the time. Don't have time to get humans comb
> millions of log lines to fix encoding errors.
>
>>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in XML,
>>> that's one part of the XML spec that worked very well) and require
>>> http/1
>>> to 2 bridges to translate to the canonical form. Helping clients push
>>> local 8bits encodings will just perpetuate pre-2000 legacy mess.
>>
>> How do you translate a URI with unknown URI encoding to UTF-8?
>
> You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with an
> error. That will make people fix their encodings quickly.
>
>>> Whenever someone specifies a new better encoding it will be time for
>>> HTTP/3. Unicode specs are way more complex than http, changes won't
>>> happen
>>> quicker than http revisions.
>>
>> The problem here is that HTTP URIs are octet sequences, not character
>> sequences.
>
> The problem is that octet sequences are useless by themselves if you can
> not decode them.
>
>
> --
> Nicolas Mailhot
>

Received on Thursday, 16 January 2014 11:25:35 UTC