W3C home > Mailing lists > Public > ietf-http-wg@w3.org > January to March 2014

Re: UTF-8 in URIs

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 16 Jan 2014 11:06:45 +0100
Message-ID: <52D7AF35.4010401@gmx.de>
To: Nicolas Mailhot <nicolas.mailhot@laposte.net>, Zhong Yu <zhong.j.yu@gmail.com>
CC: Gabriel Montenegro <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <osamam@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <michael.bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
On 2014-01-16 10:52, Nicolas Mailhot wrote:
> Le Mer 15 janvier 2014 21:46, Zhong Yu a écrit :
>> Can you give an example where an intermediary benefits from decoding
>> URI octets into unicodes?
> Intermediaries can not perform URL-based filtering it they can not decode
> URLS reliably. Intermediaries need to normalise URLs to a single encoding
> if they log them (for debugging or policy purposes). unix-like "just a
> bunch of bytes with no encoding indication" is an i18n disaster supported
> only by users of ASCII scripts

Well, you could log what you got on the wire. It's ASCII.

> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in XML,
> that's one part of the XML spec that worked very well) and require http/1
> to 2 bridges to translate to the canonical form. Helping clients push
> local 8bits encodings will just perpetuate pre-2000 legacy mess.

How do you translate a URI with unknown URI encoding to UTF-8?

> Whenever someone specifies a new better encoding it will be time for
> HTTP/3. Unicode specs are way more complex than http, changes won't happen
> quicker than http revisions.

The problem here is that HTTP URIs are octet sequences, not character 
sequences. There is no simple way to get from a) to b) without breaking 
a significant number of sites.

Best regards, Julian
Received on Thursday, 16 January 2014 10:07:16 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:14:23 UTC