- From: Zhong Yu <zhong.j.yu@gmail.com>
- Date: Fri, 17 Jan 2014 06:53:19 -0600
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
On Fri, Jan 17, 2014 at 2:19 AM, "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote: > On 2014/01/16 20:18, Zhong Yu wrote: > >> UTF-8 is not very good for CJK charsets. It may not be a big deal in >> general, however, URLs are often displayed verbatim on user >> interfaces, the length matters. > > > The idea that UTF-8 is not very efficient for CJK text is to the most part > just a myth. It is true that UTF-8 requires 3 bytes per character for CJK > characters. However, because CJK characters are ideographic or syllabic, the > number of characters is much lower than for alphabetic scripts. > > Languages with alphabetic scripts that require two bytes for each character > in UTF-8 (e.g. Greek, Russian, Arabic, Hebrew,...) easily require more > memory for the same text than CJK languages. Languages with alphabetic > scripts that require three bytes per character in UTF-8 (e.g. Hindi and all > the other Indian languages, Thai, Lao, Khmer,...) most surely require more > memory for the same text than CJK languages. So there are languages that suffer even more from UTF-8. That doesn't make me feel any better about UTF-8:) > > When UTF-8 in URIs is transported or displayed with %-encoding, an > additional factor of 3 is needed, but that's the same for all the languages > mentioned above. This is indeed quite inconvenient when displayed, but the > main inconvenience is not the length, but the fact that it cannot be read. > Displaying the actual characters, and using UTF-8 directly in transport > (i.e. essentially using IRIs) solves this problem. An UTF-16 option would be nice. Let's be honest, UTF-8 is English-centric. It may be necessary to interoprate with previous ASCII based systems. But going forward, UTF-8 should not be favored just because it is the best option for the English language. Zhong Yu
Received on Friday, 17 January 2014 12:53:46 UTC