Re: UTF-8 in URIs from Zhong Yu on 2014-01-17 (ietf-http-wg@w3.org from January to March 2014)

From: Zhong Yu <zhong.j.yu@gmail.com>
Date: Fri, 17 Jan 2014 06:53:19 -0600
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
Message-ID: <CACuKZqHURBgeCXPx1+o=c9bw-L1xA2Tum1M+TsU6X6OKVXz7MA@mail.gmail.com>

On Fri, Jan 17, 2014 at 2:19 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> On 2014/01/16 20:18, Zhong Yu wrote:
>
>> UTF-8 is not very good for CJK charsets. It may not be a big deal in
>> general, however, URLs are often displayed verbatim on user
>> interfaces, the length matters.
>
>
> The idea that UTF-8 is not very efficient for CJK text is to the most part
> just a myth. It is true that UTF-8 requires 3 bytes per character for CJK
> characters. However, because CJK characters are ideographic or syllabic, the
> number of characters is much lower than for alphabetic scripts.
>
> Languages with alphabetic scripts that require two bytes for each character
> in UTF-8 (e.g. Greek, Russian, Arabic, Hebrew,...) easily require more
> memory for the same text than CJK languages. Languages with alphabetic
> scripts that require three bytes per character in UTF-8 (e.g. Hindi and all
> the other Indian languages, Thai, Lao, Khmer,...) most surely require more
> memory for the same text than CJK languages.

So there are languages that suffer even more from UTF-8. That doesn't
make me feel any better about UTF-8:)

>
> When UTF-8 in URIs is transported or displayed with %-encoding, an
> additional factor of 3 is needed, but that's the same for all the languages
> mentioned above. This is indeed quite inconvenient when displayed, but the
> main inconvenience is not the length, but the fact that it cannot be read.
> Displaying the actual characters, and using UTF-8 directly in transport
> (i.e. essentially using IRIs) solves this problem.

An UTF-16 option would be nice. Let's be honest, UTF-8 is
English-centric. It may be necessary to interoprate with previous
ASCII based systems. But going forward, UTF-8 should not be favored
just because it is the best option for the English language.

Zhong Yu

Received on Friday, 17 January 2014 12:53:46 UTC