Re: UTF-8 in URIs

Le Jeu 16 janvier 2014 11:06, Julian Reschke a écrit :
> On 2014-01-16 10:52, Nicolas Mailhot wrote:
>>
>> Le Mer 15 janvier 2014 21:46, Zhong Yu a écrit :
>>> Can you give an example where an intermediary benefits from decoding
>>> URI octets into unicodes?
>>
>> Intermediaries can not perform URL-based filtering it they can not
>> decode
>> URLS reliably. Intermediaries need to normalise URLs to a single
>> encoding
>> if they log them (for debugging or policy purposes). unix-like "just a
>> bunch of bytes with no encoding indication" is an i18n disaster
>> supported
>> only by users of ASCII scripts
>
> Well, you could log what you got on the wire. It's ASCII.

And it's useless if you can't interpret it reliably. May as well log the
output of /dev/random at the time. Don't have time to get humans comb
millions of log lines to fix encoding errors.

>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in XML,
>> that's one part of the XML spec that worked very well) and require
>> http/1
>> to 2 bridges to translate to the canonical form. Helping clients push
>> local 8bits encodings will just perpetuate pre-2000 legacy mess.
>
> How do you translate a URI with unknown URI encoding to UTF-8?

You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with an
error. That will make people fix their encodings quickly.

>> Whenever someone specifies a new better encoding it will be time for
>> HTTP/3. Unicode specs are way more complex than http, changes won't
>> happen
>> quicker than http revisions.
>
> The problem here is that HTTP URIs are octet sequences, not character
> sequences.

The problem is that octet sequences are useless by themselves if you can
not decode them.


-- 
Nicolas Mailhot

Received on Thursday, 16 January 2014 10:24:31 UTC