Re: Upcoming changes to BCP47 (language tag) syntax

Hi Richard,

Some notes follow.
> 
> 
> It would be good to have the bulk of the text below available in some publicly accessible form so that we can point to it.

Well, of course, my email now is accessible :-). However, I intend to 
write a new bit of guidance once things settle down a bit (a la the ML 
article you adapted previously) since this bears documenting widely, 
loudly, early and often [lest there be confusion]

> 
> Also, I'm wondering what are the implications wrt use of zh and the cmn, yue, etc subtags.  Are we expecting zh to be used only for things like zh-Hans and zh-Hant?  Can we still use it in a vague way to mean 'some kind of chinese'?  Will the zh-cmn etc tags be deprecated? etc.

There is guidance in draft-4646bis on this topic (at [1]). The relevant 
paragraphs currently read:

---
In general, use the most specific subtag to form the language tag. 
However, where the macrolanguage tag has been historically used to 
denote a dominant encompassed language, it SHOULD be used in place of 
the subtag specific to that encompassed language unless it is necessary 
to clearly distinguish the macrolanguage as a whole from that enclosed 
dominant language variety.

In particular, the Chinese family of languages call for special 
consideration. Because the written form is very similar for most 
languages having 'zh' (Chinese) as a macrolanguage (and because 
historically subtags for the various encompassed languages were not 
available), languages such as 'yue' (Cantonese) have historically used 
either 'zh' or a tag (now grandfathered) beginning with 'zh'. This means 
that macrolanguage information can be usefully applied when searching 
for content or when providing fallbacks in language negotiation. For 
example, the information that 'yue' has a macrolangauge of 'zh' could be 
used in the Lookup algorithm to fallback from a request for 
"yue-Hans-CN" to "zh-Hans-CN" without losing the script and region 
information (even though the user did not specify "zh-Hans-CN" in their 
request).
---

Basically, the idea is:

- Use 'zh' when you mean "Chinese" undifferentiated by a regional 
language. Since Mandarin Chinese is (by most measures) the "dominant 
encompassed language" [because most written Chinese documents use a 
standardized form which is essentially Mandarin Chinese], this means 
that 'zh' should be used in preference to 'cmn' (or other subtags) for 
most Chinese content. This is especially true for most Web document 
formats (such as HTML, XML+*, and so on). Practically speaking, this 
means that you won't need to retag most content.

- Use 'yue' (and related tags, such as 'yue-Hant' or 'yue-HK') when you 
need to indicate "Cantonese (Yue) Chinese" as distinct from Chinese. 
Typically this will be audio (or video) content, since most written 
documents actually use a Mandarin form. Of course, written Cantonese 
(while relatively rare) also exists and should be tagged with the 'yue' 
subtag. Note that, with the exception of 'cmn', this applies to all of 
the other encompassed language subtags.

- Use 'cmn' when you need to indicate "Mandarin Chinese" as distinct 
from Chinese in general. This case is rare and should be avoided 
whenever possible. A better way to write that would be: "Do NOT use 
'cmn' unless you have a Very Good Reason."

- Replace existing grandfathered tags such as 'zh-yue' with their subtag 
equivalent (i.e. 'yue').

- If you speak a Chinese language other than Mandarin, add that language 
into your "language priority list" (e.g. HTTP Accept-Language in your 
browser) *before* you include the 'zh-*' range, and include both. For 
example, if you are a Cantonese speaker from Hong Kong, you might use a 
list like:

   Accept-Language: yue-Hant-HK;q=1.0,zh-Hant-HK;q=0.8

It is important to note that, despite the foregoing, the subtag 'zh' 
does NOT mean strictly Mandarin. It just happens to be the case that 
most 'zh' content turns out to be Mandarin.

Hope that helps. I look forward to writing some guidelines on this topic 
soon.

Best,

Addison

[1] http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-11.html#choice

> 
> Cheers,
> RI
> 
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
>  
> http://www.w3.org/International/
> http://rishida.net/blog/
> http://rishida.net/
> 
>  
> 
>> -----Original Message-----
>> From: member-i18n-core-request@w3.org [mailto:member-i18n-core-
>> request@w3.org] On Behalf Of Addison Phillips
>> Sent: 17 January 2008 05:19
>> To: I18N
>> Cc: member-i18n-core@w3.org
>> Subject: Upcoming changes to BCP47 (language tag) syntax
>>
>>
>> In this week's Internationalization Core WG teleconference, I drew an
>> action item [1] to provide more information about a proposed change to
>> the language tag ABNF (the grammar or formal syntax) in the proposed
>> successor to RFC 4646. That's because the W3C created several documents
>> [2] and [3] at about the time RFC 4646 came into being describing
>> language tags. Parts of these documents speculate about a potential
>> future feature of language tags that is now being removed or will not be
>> used. The I18N Core WG is now preparing to revise this document to keep
>> it current, and, as co-editor of the proposed replacement, I've been
>> following the details closely.
>>
>> As many of you know, RFC 4646 was created as a successor to RFC 3066 as
>> the document defining "BCP 47", the language tagging standard for
>> Internet (and other) technologies. You may know "BCP 47" as "xml:lang"
>> or as the values in the HTTP Accept-Language header, for example.
>>
>> RFC 4646 provided a more complex syntax that defined several new
>> "flavors" of subtag in addition to the language and region subtags that
>> had been formally defined previously. Most of these new types were fully
>> defined in 4646. However, one type of subtag was reserved for future
>> use: the "extended language" subtags, or, colloquially, "extlangs".
>>
>> Extended language subtags were intended to accommodate a feature of ISO
>> 639-3, whereby some languages were considered to be encompassed by
>> existing languages, which were called "macro-languages". For example,
>> Mandarin Chinese and Cantonese are both distinct languages that have
>> their own codes in ISO 639-3 (these are 'cmn' and 'yue' respectively).
>> Both of these languages (with several others) are encompassed by the
>> Macrolanguage called "Chinese", which is represented by the code 'zh' in
>> language tags.
>>
>> At the time 4646 was created, the IETF working group theorized that
>> language tags for these languages would use both the macro- and
>> encompassed language codes together. For example, a Cantonese (yue)
>> document written in the Traditional script (Hant) for Hong Kong (HK)
>> would use a tag like "zh-yue-Hant-HK".
>>
>> However, after a great deal of debate and consideration, it was decided
>> that this extlang feature would NOT be used. The encompassed and
>> macrolanguage codes would both appear as potential primary language
>> subtags and the extended language subtag would not be used. Thus, for
>> example, the document described above would use the tag "yue-Hant-HK".
>>
>> It should be noted that the IETF working group for language tags has
>> also decided to remove the extlang production from the language tag
>> syntax. This production was explicitly reserved for future use and no
>> tags have ever been valid that used it. A few tags were registered
>> during the RFC 3066 era that appear to use these subtags, but these were
>> separately handled by the "grandfathered" productions in the grammar.
>>
>> Removing extlang altogether will simplify writing language tag
>> processors and relex some of the minimum length requirements previously
>> imposed.
>>
>> Finally, this move was not taken without considerable debate and
>> discussion. Some of the macrolanguages are obscure, but Chinese and
>> Arabic languages are among those affected. Those interested in the
>> macrolanguage mapping list can refer to the ISO639-3RA's page showing
>> the current mappings [4].
>>
>> The proposed successor is now nearing completion. A link to the current
>> draft of the document can be found on my page [5], along with links to
>> the IETF LTRU WG responsible for this document, the mail archive, and so
>> forth.
>>
>> Best Regards,
>>
>> Addison
>>
>> [1] http://www.w3.org/2008/01/16-core-minutes.html#action04
>> [2] http://www.w3.org/International/articles/bcp47/
>> [3] http://www.w3.org/International/articles/language-tags/#iana
>> [4] http://www.sil.org/iso639%2D3/macrolanguages.asp
>> [5] http://www.inter-locale.com
>>
>> --
>> Addison Phillips
>> Globalization Architect -- Yahoo! Inc.
>> Chair -- W3C Internationalization Core WG
>>
>> Internationalization is an architecture.
>> It is not a feature.
> 


-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.

Received on Monday, 21 January 2008 22:13:19 UTC