Re: UTF-8 in URIs

On Fri, Jan 17, 2014 at 2:19 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> On 2014/01/16 20:18, Zhong Yu wrote:
>
>> UTF-8 is not very good for CJK charsets. It may not be a big deal in
>> general, however, URLs are often displayed verbatim on user
>> interfaces, the length matters.
>
>
> The idea that UTF-8 is not very efficient for CJK text is to the most part
> just a myth. It is true that UTF-8 requires 3 bytes per character for CJK
> characters. However, because CJK characters are ideographic or syllabic, the
> number of characters is much lower than for alphabetic scripts.
>
> Languages with alphabetic scripts that require two bytes for each character
> in UTF-8 (e.g. Greek, Russian, Arabic, Hebrew,...) easily require more
> memory for the same text than CJK languages. Languages with alphabetic
> scripts that require three bytes per character in UTF-8 (e.g. Hindi and all
> the other Indian languages, Thai, Lao, Khmer,...) most surely require more
> memory for the same text than CJK languages.

So there are languages that suffer even more from UTF-8. That doesn't
make me feel any better about UTF-8:)

>
> When UTF-8 in URIs is transported or displayed with %-encoding, an
> additional factor of 3 is needed, but that's the same for all the languages
> mentioned above. This is indeed quite inconvenient when displayed, but the
> main inconvenience is not the length, but the fact that it cannot be read.
> Displaying the actual characters, and using UTF-8 directly in transport
> (i.e. essentially using IRIs) solves this problem.

An UTF-16 option would be nice. Let's be honest, UTF-8 is
English-centric. It may be necessary to interoprate with previous
ASCII based systems. But going forward, UTF-8 should not be favored
just because it is the best option for the English language.

Zhong Yu

Received on Friday, 17 January 2014 12:53:46 UTC