Re: Simplified or traditional for each Chinese macrolanguage from Ambrose LI on 2016-07-27 (public-i18n-cjk@w3.org from July to September 2016)

From: Ambrose LI <ambrose.li@gmail.com>
Date: Wed, 27 Jul 2016 00:09:59 -0400
To: Xidorn Quan <me@upsuper.org>
Cc: Koji Ishii <kojiishi@gmail.com>, John Cowan <cowan@mercury.ccil.org>, 董福興 <bobbytung@wanderer.tw>, CJK discussion <public-i18n-cjk@w3.org>, Makoto Kato <m_kato@ga2.so-net.ne.jp>, 劉慶 <ryukeikun@gmail.com>
Message-ID: <CADJvFOWt3_2akSSzGR16c5BeCfgqfGPZWQ-1++sZqMs9O=KEbg@mail.gmail.com>

2016-07-26 23:34 GMT-04:00 Xidorn Quan <me@upsuper.org>:
> On Wed, Jul 27, 2016, at 12:13 PM, Koji Ishii wrote:
>
>> So Literary Chinese and Mandarin are hard to determine? I checked Windows
>> region/locale/language settings but it doesn't seem to have these in the
>> list.
>
> Mandarin is just what we generally refer to by saying "Chinese". Literary is
> a historic language of Chinese, which is not used in daily life nowadays.

>> Maybe we should handle them as "unknown", so that browsers fallback to use
>> the system setting?
>
> What should zh (without anthing else) do, actually? What happens to that
> should probably be what we do for Mandarin and Literary.

zh defaults to simplified, right? I assume zh-cmn isn't really that
common? Maybe we need to tell people to just skip tagging text as
Mandarin as use zh-cmn-hant and zh-cmn-hans instead.

But come to think about it, I actually like the idea of treating
Mandarin as unknown, but I think the semantics shouldn't be "fallback
to use the system setting", but something more like "inherit if
possible, else fall back to the system setting".

What I have in mind is a use case to the effect of

    <html lang=zh-hant>
    [...] We call this <span lang=zh-yue>xxx</span>. However, in
Mandarin the same thing is called <span lang=zh-cmn>yyy</span>.

(Obviously, this would also apply to other dialects. For example, in a
simplified Chinese text describing language differences, a term tagged
zh-yue is most likely also in simplified Chinese, not traditional
Chinese.)

There are obviously other possibilities, such as an entire audio
transcript tagged as "Mandarin". In this scenario there would be no
script to inherit from and we'll probably have to guess.

Thoughts?

>> FYI, Wikipedia[1] already uses lzh", without script.
>
> If we use Wikipedia as the criterion, the list would significantly change.
> Basically as far as I can see, Wikipedia uses Traditional Chinese in almost
> every Chinese languages it has a version for. But I suspect that most of
> those Wikipedia are built by language enthusiasts, and not used by people in
> general, so I tend not to pick that as a criterion.

I only know the zh, lzh, and yue versions. FWIW, IMHO traditional for
lzh makes a lot of sense because mapping back from simplified to
traditional is problematic. That's true for zh as well but I guess
since so many people use simplified these days accepting simplified is
unavoidable.

For yue, I mentioned that zh-yue-hant and zh-yue-hans use different
conventions. To put it another way, if you do a Unicode-based
conversion from zh-yue-hant to zh-yue-hans you essentially get
gibberish, and if you do it the other direction you also get
gibberish. This is even worse than the usual conversion between
zh-hans and zh-hant, so I assume they just had to pick one and stick
with it.

> But on the other hand, I guess those language tags are almost only used in
> Wikipedia, and not anywhere else...

I’ve seen lzh used in software. So it (as a valid ISO language code)
certainly is being used elsewhere, though probably very rarely.

-- 
Ambrose Li // http://o.gniw.ca / http://gniw.ca
If you saw this on CE-L: You do not need my permission to quote
me, only proper attribution. Always cite your sources, even if
you have to anonymize and/or cite it as "personal communication".

Received on Wednesday, 27 July 2016 04:11:09 UTC