- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Fri, 17 Jan 2014 17:19:52 +0900
- To: Zhong Yu <zhong.j.yu@gmail.com>
- CC: Bjoern Hoehrmann <derhoermi@gmx.net>, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
On 2014/01/16 20:18, Zhong Yu wrote: > UTF-8 is not very good for CJK charsets. It may not be a big deal in > general, however, URLs are often displayed verbatim on user > interfaces, the length matters. The idea that UTF-8 is not very efficient for CJK text is to the most part just a myth. It is true that UTF-8 requires 3 bytes per character for CJK characters. However, because CJK characters are ideographic or syllabic, the number of characters is much lower than for alphabetic scripts. Languages with alphabetic scripts that require two bytes for each character in UTF-8 (e.g. Greek, Russian, Arabic, Hebrew,...) easily require more memory for the same text than CJK languages. Languages with alphabetic scripts that require three bytes per character in UTF-8 (e.g. Hindi and all the other Indian languages, Thai, Lao, Khmer,...) most surely require more memory for the same text than CJK languages. When UTF-8 in URIs is transported or displayed with %-encoding, an additional factor of 3 is needed, but that's the same for all the languages mentioned above. This is indeed quite inconvenient when displayed, but the main inconvenience is not the length, but the fact that it cannot be read. Displaying the actual characters, and using UTF-8 directly in transport (i.e. essentially using IRIs) solves this problem. Regards, Martin.
Received on Friday, 17 January 2014 08:20:52 UTC