Re: UTF-8 in URIs from Martin J. Dürst on 2014-01-17 (ietf-http-wg@w3.org from January to March 2014)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 17 Jan 2014 17:19:52 +0900
To: Zhong Yu <zhong.j.yu@gmail.com>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
Message-ID: <52D8E7A8.5050709@it.aoyama.ac.jp>

On 2014/01/16 20:18, Zhong Yu wrote:

> UTF-8 is not very good for CJK charsets. It may not be a big deal in
> general, however, URLs are often displayed verbatim on user
> interfaces, the length matters.

The idea that UTF-8 is not very efficient for CJK text is to the most 
part just a myth. It is true that UTF-8 requires 3 bytes per character 
for CJK characters. However, because CJK characters are ideographic or 
syllabic, the number of characters is much lower than for alphabetic 
scripts.

Languages with alphabetic scripts that require two bytes for each 
character in UTF-8 (e.g. Greek, Russian, Arabic, Hebrew,...) easily 
require more memory for the same text than CJK languages. Languages with 
alphabetic scripts that require three bytes per character in UTF-8 (e.g. 
Hindi and all the other Indian languages, Thai, Lao, Khmer,...) most 
surely require more memory for the same text than CJK languages.

When UTF-8 in URIs is transported or displayed with %-encoding, an 
additional factor of 3 is needed, but that's the same for all the 
languages mentioned above. This is indeed quite inconvenient when 
displayed, but the main inconvenience is not the length, but the fact 
that it cannot be read. Displaying the actual characters, and using 
UTF-8 directly in transport (i.e. essentially using IRIs) solves this 
problem.

Regards,   Martin.

Received on Friday, 17 January 2014 08:20:52 UTC