W3C home > Mailing lists > Public > ietf-http-wg@w3.org > January to March 2014

Re: UTF-8 in URIs

From: Zhong Yu <zhong.j.yu@gmail.com>
Date: Thu, 16 Jan 2014 05:18:45 -0600
Message-ID: <CACuKZqE2bo4WaseWPa26TW8v1Yqa6i_QD3iCOR2GsmKkdNpKwQ@mail.gmail.com>
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
On Thu, Jan 16, 2014 at 5:00 AM, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
> * Gabriel Montenegro wrote:
>>Some of us (cc line) have been discussing the unfortunate lack of
>>determinism with respect to URI encoding in HTTP/1.1 and would like
>>HTTP/2.0 to improve upon the situation.
>
> The practise of encoding character data in `http:` addresses using
> anything other than UTF-8 is dying out fast and it is rather unclear

UTF-8 is not very good for CJK charsets. It may not be a big deal in
general, however, URLs are often displayed verbatim on user
interfaces, the length matters.

> what practical benefit there is in discriminating between addresses
> that use only character data and all character data is UTF-8-encoded
> and addresses that include non-character data or use some legacy en-
> coding.
>
> Note that it is perfectly normal to run a service like
>
>   http://example.org/transcode?from=iso-8859-1&to=utf-8&bytes=%C3%B6
>
> Also note that a client cannot possibly know `%C3%B6` can be inter-
> preted as UTF-8 bytes without the server telling it as much. This does
> not change when it's instead
>
>   http://example.org/transcode/from/iso-8859-1/to/utf-8/bytes/%C3%B6
>
> Further note that some clients, for display purposes, treat at least
> one of the two examples as though the `%C3%B6` were UTF-8.
>
>>In either case, the value to denote the charset would be a 32-bit
>>integer equivalent to the "MIBenum" value in the IANA registry
>>(http://www.iana.org/assignments/character-sets/character-sets.xhtml).
>>Hence, the value would be 106 for UTF-8. The legacy behavior of
>>non-determinism is indicated via the value 0. Notice that this is a
>>reserved value for MIBenum.
>
> Allowing arbitrary encodings needs an exceedingly good reason.
> --
> Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
> Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
> 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
>
Received on Thursday, 16 January 2014 11:19:12 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:14:23 UTC