Re: UTF-8 in URIs

Le Jeu 16 janvier 2014 11:28, Julian Reschke a écrit :
> On 2014-01-16 11:24, Nicolas Mailhot wrote:
>> Le Jeu 16 janvier 2014 11:06, Julian Reschke a écrit :
>>> On 2014-01-16 10:52, Nicolas Mailhot wrote:
>>>>
>>>> Le Mer 15 janvier 2014 21:46, Zhong Yu a écrit :
>>>>> Can you give an example where an intermediary benefits from decoding
>>>>> URI octets into unicodes?
>>>>
>>>> Intermediaries can not perform URL-based filtering it they can not
>>>> decode
>>>> URLS reliably. Intermediaries need to normalise URLs to a single
>>>> encoding
>>>> if they log them (for debugging or policy purposes). unix-like "just a
>>>> bunch of bytes with no encoding indication" is an i18n disaster
>>>> supported
>>>> only by users of ASCII scripts
>>>
>>> Well, you could log what you got on the wire. It's ASCII.
>>
>> And it's useless if you can't interpret it reliably. May as well log the
>> output of /dev/random at the time. Don't have time to get humans comb
>> millions of log lines to fix encoding errors.
>
> Define "encoding error" in the context of a URI.

Any URI that can not be reliably decoded in the textual representation the
URL creator preferred by a random http processor (web site, intermediary,
web client) without outside help.

And there *is* a preferred textual representation because you know, people
do not enter URLs in binary editors.

>>>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in
>>>> XML,
>>>> that's one part of the XML spec that worked very well) and require
>>>> http/1
>>>> to 2 bridges to translate to the canonical form. Helping clients push
>>>> local 8bits encodings will just perpetuate pre-2000 legacy mess.
>>>
>>> How do you translate a URI with unknown URI encoding to UTF-8?
>>
>> You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with an
>> error. That will make people fix their encodings quickly.
>
> This is not going to work:
>
> a) People may have chosen a non-UTF8 encoding by accident (system locale
> etc) and can't change it retroactively,

They can add an UTF-8 translator gateway at http/2 adoption time. No
different and much easier than the mass of documents that needed to be
fixed once people started exchanging them in binary not dead wood form.
Some past mistakes need correction you can't grandfather them eternally at
the cost of eternal future interop problems.

> b) There might be actual *binary* data in the URI.

So just define the canonical binary-to-utf8 mapping. If you don't your URL
will crash as soon as it needs to be displayed in an address bar, network
console or activity log. Again, much easier to define a single
binary-to-utf8 mapping than random encoding to random display encoding
rules (hint: it is not possible and that's the core problem).

>>>> Whenever someone specifies a new better encoding it will be time for
>>>> HTTP/3. Unicode specs are way more complex than http, changes won't
>>>> happen
>>>> quicker than http revisions.
>>>
>>> The problem here is that HTTP URIs are octet sequences, not character
>>> sequences.
>>
>> The problem is that octet sequences are useless by themselves if you can
>> not decode them.
>
> Hm, no. They just happen to work in a way different from your
> preference, but they do just work fine.

No, they don't work. Working is not "avoid any automated processing and
for christsakes never use anything but ASCII because the state is
undefined and things will randomly break"

There is a reason url bars do not display binary sequences. You need
reliable text decoding for humans to get a grip on the result. %foo
punicode approach is predicated on browsers being able to display a text
form, otherwise the result is just binary soup and humans can not
distinguish correct binary soup from hijacked binary soup.

All the talk about binary http2 being ok because someone will just write a
wireshark plugin is just wind if the wireshark plugin does not have
reliable rules to translate the on-wire representation to text humans
understand.

Best regards,

-- 
Nicolas Mailhot

Received on Thursday, 16 January 2014 14:57:47 UTC