Re: Language Identifier List Comments, updated from John Cowan on 2004-12-27 (www-international@w3.org from October to December 2004)

From: John Cowan <jcowan@reutershealth.com>
Date: Mon, 27 Dec 2004 13:07:07 -0500
To: Andrew Cunningham <andj_c@iprimus.com.au>
Cc: Tex Texin <tex@xencraft.com>, WWW International <www-international@w3.org>, IETF Languages <ietf-languages@iana.org>
Message-ID: <20041227180707.GK2927@skunk.reutershealth.com>
Andrew Cunningham scripsit:

> ar-SD (Arabic)  also this tag could be considered to be ambiguous .. is 
> it the national language of Sudan (Standard Modern Arabic) or is it the 
> Sudanese Arabic dialect?

At present the ar language tag is inherently ambiguous: ISO 639-1 maps
it to the name "Arabic", without any clarification of what language
or languages "Arabic" might refer to.  The editor's draft of ISO 639-3
clarifies the mapping of "ar" to refer to both Modern Standard Arabic
and the colloquials, and tentatively assigns the code "arb" to MSA and
different codes to the 29 recognized colloquials.

In the language of the draft, codes like "ar" refer to what are called
"macro-languages", explained as follows:

# In various parts of the world, there are clusters of closely-related
# language varieties that, based on the criteria discussed in 4.2.1,
# can be considered individual languages, yet in certain usage contexts a
# single language identity for all is needed. Typical situations in which
# this need can occur include the following:
# 
#  There is one variety that is more developed and that tends
#  to be used for wider communication by speakers of various
#  closely-related languages; as a result, there is a perceived
#  common linguistic identity across these languages. For instance,
#  there are several distinct spoken Arabic languages, but Standard
#  Arabic is generally used in business and media across all of
#  these communities, and is also an important aspect of a shared
#  ethno-religious unity. As a result, a perceived common linguistic
#  identity exists.
# 
#  There is a common written form used for multiple closely-related
#  languages. For instance, multiple Chinese languages share a
#  common written form.
# 
#  There is a transitional socio-linguistic situation in which
#  sub-communities of a single language community are diverging,
#  creating a need for some purposes to recognise distinct
#  languages while, for other purposes, a single common identity
#  is still valid. For instance, in some business contexts it is
#  necessary to make a distinction between Bosnian, Croatian and
#  Serbian languages, yet there are other contexts in which these
#  distinctions are not discernable in language resources that are
#  in use.
# 
# Where such situations exist, an identifier for the single, common language
# identity is considered in this part of ISO 639 to be a macrolanguage
# identifier.  Macrolanguages are distinguished from language collections
# in that the individual languages that correspond to a macrolanguage must
# be very closely related, and there must be some domain in which only a
# single language identity is recognized.

The draft specifies the following 56 macrolanguages:

ak   Akan (2 languages)
ar   Arabic (30 languages)
ay   Aymara (2 languages)
az   Azerbaijani (2 languages)
bal  Baluchi (3 languages)
bik  Bikol (5 languages)
bua  Buriat (3 languages)
chm  Mari (2 languages)
cr   Cree (6 languages)
del  Delaware (2 languages)
den  Slave (2 languages)
din  Dinka (5 languages)
doi  Dogri (2 languages)
fa   Persian (2 languages)
ff   Fulah (9 languages)
fy   Frisian (3 languages)
gba  Gbaya (5 languages)
gn   Guarani (5 languages)
gon  Gondi (2 languages)
grb  Grebo (5 languages)
hai  Haida (2 languages)
hbs  Serbo-Croatian (3 languages)
hmn  Hmong (21 languages)
ik   Inupiaq (2 languages)
iu   Inuktitut (2 languages)
jrb  Judeo-Arabic (5 languages)
kg   Kongo (3 languages)
kok  Konkani (2 languages)
kpe  Kpelle (2 languages)
kr   Kanuri (3 languages)
ku   Kurdish (3 languages)
kv   Komi (2 languages)
lah  Lahnda (8 languages)
man  Mandingo (7 languages)
mg   Malagasy (10 languages)
mn   Mongolian (2 languages)
ms   Malay (13 languages)
mwr  Marwari (7 languages)
no   Norwegian (2 languages)
oc   Occitan; Proven)B��l (5 languages)
oj   Ojibwa (7 languages)
om   Oromo (4 languages)
ps   Pushto (3 languages)
qu   Quechua (44 languages)
raj  Rajasthani (6 languages)
rom  Romany (7 languages)
sc   Sardinian (4 languages)
sq   Albanian (4 languages)
sw   Swahili (2 languages)
syr  Syriac (2 languages)
tmh  Tamashek (4 languages)
uz   Uzbek (2 languages)
yi   Yiddish (2 languages)
za   Zhuang (2 languages)
zap  Zapotec (58 languages)
zh   Chinese (13 languages)

-- 
Even a refrigerator can conform to the XML      John Cowan
Infoset, as long as it has a door sticker       jcowan@reutershealth.com
saying "No information items inside".           http://www.reutershealth.com
        --Eve Maler                             http://www.ccil.org/~cowan
Received on Monday, 27 December 2004 18:07:52 UTC