W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

RE: Language Identifier List up for comments

From: Peter Constable <petercon@microsoft.com>
Date: Wed, 15 Dec 2004 07:13:17 -0800
Message-ID: <F8ACB1B494D9734783AAB114D0CE68FE048571CB@RED-MSG-52.redmond.corp.microsoft.com>
To: <www-international@w3.org>

> From: www-international-request@w3.org [mailto:www-international-
> request@w3.org] On Behalf Of Elizabeth J. Pyatt


> I concede that the encoding tag is not enough to specify the script,
> but I would consider script to be a  third meta tag. (i.e. ISO-15924
> - http://www.unicode.org/iso15924/iso15924-codes.html)

ISO 639 identifiers are not enough to capture script differences, but in
W3 protocols what is used is RFC 3066, which allows for ISO 639 IDs to
be combined with other identifiers. There is precedent for incorporating
ISO 15924 script IDs (e.g. zh-Hans) in registered language tags. Also, a
revision of RFC 3066 is underway, and at last call, that fully
incorporates ISO 15924 into the spec, so that language-script
combinations do not have to be one-off registrations, as is already the
case for language-country combinations.

The draft for this revision can be obtained at
http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt. 


> I see that using Chinese-TW is NOT recommended, and I am glad to see
> that. I also see why "zh" would not be helpful in of itself as it is
> currently defined. I was assuming a definition of "zh" as the written
> form used in Chinese dialect communities, but that does not appear to
> be the correct definition. It would not be Mandarin Chinese because
> it can be read all over the country by speakers of the different
> dialects.

The language ID "zh" is defined in ISO 639 to mean "Chinese" (see
http://www.loc.gov/standards/iso639-2/langcodes.html#uvwxyz). This ID
was first used by terminologists in situations in which it most likely
meant Mandarin, but as you note there's a measure of language-neutrality
to this written form. It was adopted for use in software (e.g. POSIX
locale IDs zh_CN, zh_TW), and then on the Internet and Web (via RFC
1766, later replaced by RFC 3066). In all of these contexts, it could be
seen as either Mandarin or generic-Chinese depending on one's
perspective.

An interesting development arose in late 1999, however: the following
language tags for use in RFC 3066 applications were registered for
various Chinese languages:

zh-gan       Kan or Gan
zh-guoyu     Mandarin or Standard Chinese
zh-hakka     Hakka
zh-min       Min, Fuzhou, Hokkien, Amoy or Taiwanese
zh-wuu       Shanghaiese or Wu
zh-xiang     Xiang or Hunanese
zh-yue       Cantonese

That established two interesting precedents, IMO:

1. that "zh" had the generic-Chinese meaning

2. that RFC 3066 language tags could take the form X-Y where X is an ISO
639 ID denoting a somewhat-generic language variety and Y is a qualifier
identifying a specific language variety encompassed by X.


A new part to the ISO 639 series of standards is in preparation (ISO
639-3, about to be circulated for DIS ballot) that provides a much
larger set of IDs with far more complete coverage than is currently
available. This will provide three-letter IDs for each of the Chinese
languages. Once published, it is expected that it will be incorporated
into a revision of RFC 3066. At that point, there will be identifiers
available for each individual Chinese language; e.g. "gan". It's
possible the revision to RFC 3066 that incorporates ISO 639-3 would give
users the option to use either "gan" or "zh-gan".



Peter Constable
Received on Wednesday, 15 December 2004 15:13:50 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT