Upcoming changes to BCP47 (language tag) syntax

In this week's Internationalization Core WG teleconference, I drew an 
action item [1] to provide more information about a proposed change to 
the language tag ABNF (the grammar or formal syntax) in the proposed 
successor to RFC 4646. That's because the W3C created several documents 
[2] and [3] at about the time RFC 4646 came into being describing 
language tags. Parts of these documents speculate about a potential 
future feature of language tags that is now being removed or will not be 
used. The I18N Core WG is now preparing to revise this document to keep 
it current, and, as co-editor of the proposed replacement, I've been 
following the details closely.

As many of you know, RFC 4646 was created as a successor to RFC 3066 as 
the document defining "BCP 47", the language tagging standard for 
Internet (and other) technologies. You may know "BCP 47" as "xml:lang" 
or as the values in the HTTP Accept-Language header, for example.

RFC 4646 provided a more complex syntax that defined several new 
"flavors" of subtag in addition to the language and region subtags that 
had been formally defined previously. Most of these new types were fully 
defined in 4646. However, one type of subtag was reserved for future 
use: the "extended language" subtags, or, colloquially, "extlangs".

Extended language subtags were intended to accommodate a feature of ISO 
639-3, whereby some languages were considered to be encompassed by 
existing languages, which were called "macro-languages". For example, 
Mandarin Chinese and Cantonese are both distinct languages that have 
their own codes in ISO 639-3 (these are 'cmn' and 'yue' respectively). 
Both of these languages (with several others) are encompassed by the 
Macrolanguage called "Chinese", which is represented by the code 'zh' in 
language tags.

At the time 4646 was created, the IETF working group theorized that 
language tags for these languages would use both the macro- and 
encompassed language codes together. For example, a Cantonese (yue) 
document written in the Traditional script (Hant) for Hong Kong (HK) 
would use a tag like "zh-yue-Hant-HK".

However, after a great deal of debate and consideration, it was decided 
that this extlang feature would NOT be used. The encompassed and 
macrolanguage codes would both appear as potential primary language 
subtags and the extended language subtag would not be used. Thus, for 
example, the document described above would use the tag "yue-Hant-HK".

It should be noted that the IETF working group for language tags has 
also decided to remove the extlang production from the language tag 
syntax. This production was explicitly reserved for future use and no 
tags have ever been valid that used it. A few tags were registered 
during the RFC 3066 era that appear to use these subtags, but these were 
separately handled by the "grandfathered" productions in the grammar.

Removing extlang altogether will simplify writing language tag 
processors and relex some of the minimum length requirements previously 
imposed.

Finally, this move was not taken without considerable debate and 
discussion. Some of the macrolanguages are obscure, but Chinese and 
Arabic languages are among those affected. Those interested in the 
macrolanguage mapping list can refer to the ISO639-3RA's page showing 
the current mappings [4].

The proposed successor is now nearing completion. A link to the current 
draft of the document can be found on my page [5], along with links to 
the IETF LTRU WG responsible for this document, the mail archive, and so 
forth.

Best Regards,

Addison

[1] http://www.w3.org/2008/01/16-core-minutes.html#action04
[2] http://www.w3.org/International/articles/bcp47/
[3] http://www.w3.org/International/articles/language-tags/#iana
[4] http://www.sil.org/iso639%2D3/macrolanguages.asp
[5] http://www.inter-locale.com

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.

Received on Thursday, 17 January 2008 05:18:58 UTC