Re: UTF-8 in URIs

On 2014-01-16 11:24, Nicolas Mailhot wrote:
> Le Jeu 16 janvier 2014 11:06, Julian Reschke a écrit :
>> On 2014-01-16 10:52, Nicolas Mailhot wrote:
>>>
>>> Le Mer 15 janvier 2014 21:46, Zhong Yu a écrit :
>>>> Can you give an example where an intermediary benefits from decoding
>>>> URI octets into unicodes?
>>>
>>> Intermediaries can not perform URL-based filtering it they can not
>>> decode
>>> URLS reliably. Intermediaries need to normalise URLs to a single
>>> encoding
>>> if they log them (for debugging or policy purposes). unix-like "just a
>>> bunch of bytes with no encoding indication" is an i18n disaster
>>> supported
>>> only by users of ASCII scripts
>>
>> Well, you could log what you got on the wire. It's ASCII.
>
> And it's useless if you can't interpret it reliably. May as well log the
> output of /dev/random at the time. Don't have time to get humans comb
> millions of log lines to fix encoding errors.

Define "encoding error" in the context of a URI.

>>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in XML,
>>> that's one part of the XML spec that worked very well) and require
>>> http/1
>>> to 2 bridges to translate to the canonical form. Helping clients push
>>> local 8bits encodings will just perpetuate pre-2000 legacy mess.
>>
>> How do you translate a URI with unknown URI encoding to UTF-8?
>
> You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with an
> error. That will make people fix their encodings quickly.

This is not going to work:

a) People may have chosen a non-UTF8 encoding by accident (system locale 
etc) and can't change it retroactively,

b) There might be actual *binary* data in the URI.

>>> Whenever someone specifies a new better encoding it will be time for
>>> HTTP/3. Unicode specs are way more complex than http, changes won't
>>> happen
>>> quicker than http revisions.
>>
>> The problem here is that HTTP URIs are octet sequences, not character
>> sequences.
>
> The problem is that octet sequences are useless by themselves if you can
> not decode them.

Hm, no. They just happen to work in a way different from your 
preference, but they do just work fine.

Best regards, Julian

(Yes, I'd prefer that there'd be more UTF-8 in HTTP, but there are 
problems that are hard to solve without breaking things)

Received on Thursday, 16 January 2014 10:29:23 UTC