Re: Language Identifier List up for comments

From: Mark Davis <mark.davis@jtcsv.com> · Date: Tue, 14 Dec 2004 12:48:01 -0800

I think Richard was making a very different point. In particular, I share
his concerns as given in [4].

I don't know what this list is intended for, nor how it would be used (or
misused), nor precisely what it is supposed to measure, nor the criteria for
being on or off the list. Do the authors thinkg that someone supposed to
reject a language tag containing a region that is not on the list? Or that
localizations be limited to the list? Or include all of the list?

Here's the way I see it. There ar a couple of possible different lists that
would be interesting information.

A. A set of language subtags for which there is no difference in written or
spoken form based on region. This, however, would be rather difficult to
determine. I suspect the only qualifying ones would be those that were
essentially limited to a single region. Even for a language like Japanese
you'd have to verify that the Japanese immigrant community in the US,
Brazil, etc. spoke and wrote identically to their counterparts in Japan.
(More useful would be the size of given sub-populations, either heads or by
various economic measures.)

B. a set of language->region mappings that include every region where there
is a significant population base of native speakers or the language is an
official language of that country. E.g. something like:

EN => AS AU BM BW BZ CA CM GB GH HK IE IN JM MT NG NZ PH PK PU RH SG TT UG
UM US VG VI ZA ZW
FR => BE CA CD CH CI FR FX LU MC PF RE
...
This list would be easier to derive, although the criteria for
"significantly" would present its own challenges.

As near as I can gather,
http://www.i18nguy.com/unicode/language-identifiers.html is somehow a
product of A and B. It would be the set of language tags where there was a
difference between region subtags (A). It would either have to list all the
language tags that could be composed from B, or if it didn't have all of the
regions, it would have to indicate which of those regions were identified
with the "regionless" language tag (and that decision itself is fraught with
political issues).

By the way, it's missing (depending on how the criteria are applied):
aa-DJ, aa-ER, aa-ET, af-ZA, am-ET, ar-IN, as-IN, az-AZ, be-BY, bg-BG,
byn-ER, ca-ES, cs-CZ, dv-MV, dz-BT, en-BE, en-HK, en-IN, en-MH, en-UM,
et-EE, eu-ES, fa-AF, fa-IR, fi-FI, fo-FO, gez-ER, gez-ET, gl-ES, gu-IN,
haw-US, he-IL, hi-IN, hy-AM, id-ID, is-IS, ja-JP, ka-GE, kk-KZ, kl-GL,
km-KH, kn-IN, kok-IN, ky-KG, lo-LA, lt-LT, lv-LV, mk-MK, ml-IN, mn-MN,
mr-IN, mt-MT, nb-NO, nn-NO, om-ET, om-KE, or-IN, pa-IN, pl-PL, ps-AF, ro-RO,
ru-RU, ru-UA, sa-IN, sh-YU, sid-ET, sk-SK, sl-SI, so-DJ, so-ET, so-KE,
so-SO, sq-AL, sr-Cyrl, sr-Cyrl-YU, sr-Latn, sr-Latn-YU, syr-SY, te-IN,
th-TH, ti-ER, ti-ET, tig-ER, tt-RU, uk-UA, uz-AF, uz-UZ, vi-VN, wal-ET,
zh-HK, zh-Hans, zh-Hans-CN, zh-Hans-SG, zh-Hant, zh-Hant-HK, zh-Hant-MO,
zh-Hant-TW, zh-MO

‎Mark

----- Original Message ----- 
From: "Tex Texin" <tex@xencraft.com>
To: "Richard Ishida" <ishida@w3.org>
Cc: <www-international@w3.org>; <ietf-languages@alvestrand.no>
Sent: Tuesday, December 14, 2004 11:02
Subject: Re: Language Identifier List up for comments

> I agree, and that's why we need to provide more guidance than we have done
to
> date.
>
> Richard Ishida wrote:
> >
> > Comments:
> >
> > [1] For Chinese: What about zh-Hans and zh-Hant?  What about the IANA
stuff
> > like zh-hakka, etc.?
> >
> > [2] What if I just want to say "This is Turkish - but I don't know which
> > dialect"?  The list makes it seem like I *need* to choose one of the
country
> > variants.
> >
> > [3] Is there a big enough difference between en-GB and, say, en-FK that
I
> > should need to distinguish between the two?
> >
> > [4] I'm not clear about the value of the list.  A list like this
suggests to
> > me that things can be looked up here without a great deal of thought.
I'm
> > not convinced that that is true.  And once one applies a little thought
> > about the most appropriate label to use, it is hardly difficult to come
up
> > with the appropriate country code.  Perhaps there would be a minimal
value
> > in helping find some of the country codes you might need, but then I
would
> > organise the information slightly differently.
> >
> > [5] I think the choice of language code also depends on the intended
usage.
> > That is very hard to predict, of course.  If one is simply applying a
> > different font to English text embedded in an Arabic document, then I
think
> > labelling with subcodes is overkill.  If labelling English text for use
with
> > a spell checker, a distinction between en-US and en-GB is typically
useful
> > because spell checkers for English tend to take that distinction into
> > account - whether that applies for all variants of other languages is
not
> > clear to me.  If dealing with a text to speech application that can
> > distinguish accents such as en-UK-scouse, then a higher level of detail
is
> > needed than that given in the table. If dealing with Accept-Language
> > declarations, then you must declare both en and en-UK/en-US in a
browser,
> > otherwise you won't always get the results you expected. I think the
table
> > over-simplifies the question.  I'll concede that the answer to the
question
> > is very difficult to produce, but my concern is that the table seems to
be
> > offering a solution, by fiat, that is not always correct, and doesn't
say
> > that clearly enough.
> >
> > [6] typo: Lingala uses an upper case 'I'
> >
> > RI
> >
> > ============
> > Richard Ishida
> > W3C
> >
> > contact info:
> > http://www.w3.org/People/Ishida/
> >
> > W3C Internationalization:
> > http://www.w3.org/International/
> >
> > Publication blog:
> > http://people.w3.org/rishida/blog/
> >
> >
> >
> > > -----Original Message-----
> > > From: www-international-request@w3.org
> > > [mailto:www-international-request@w3.org] On Behalf Of Tex Texin
> > > Sent: 14 December 2004 10:43
> > > To: www-international@w3.org
> > > Cc: www-international@w3.org; ietf-languages@alvestrand.no
> > > Subject: Language Identifier List up for comments
> > >
> > >
> > > http://www.i18nguy.com/unicode/language-identifiers.html
> > >
> > > I will add caveats and expand the list to be both one level
> > > and two level as we go along.
> > >
> > > I am in a busy patch, so comment now, but I won't make many
> > > updates until the weekend.
> > >
> > > tex
> > >
> > >
>
> -- 
> -------------------------------------------------------------
> Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
> Xen Master                          http://www.i18nGuy.com
>
> XenCraft             http://www.XenCraft.com
> Making e-Business Work Around the World
> -------------------------------------------------------------
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages@alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
>