- From: Peter Constable <petercon@microsoft.com>
- Date: Wed, 15 Dec 2004 07:13:17 -0800
- To: <www-international@w3.org>
> From: www-international-request@w3.org [mailto:www-international- > request@w3.org] On Behalf Of Elizabeth J. Pyatt > I concede that the encoding tag is not enough to specify the script, > but I would consider script to be a third meta tag. (i.e. ISO-15924 > - http://www.unicode.org/iso15924/iso15924-codes.html) ISO 639 identifiers are not enough to capture script differences, but in W3 protocols what is used is RFC 3066, which allows for ISO 639 IDs to be combined with other identifiers. There is precedent for incorporating ISO 15924 script IDs (e.g. zh-Hans) in registered language tags. Also, a revision of RFC 3066 is underway, and at last call, that fully incorporates ISO 15924 into the spec, so that language-script combinations do not have to be one-off registrations, as is already the case for language-country combinations. The draft for this revision can be obtained at http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt. > I see that using Chinese-TW is NOT recommended, and I am glad to see > that. I also see why "zh" would not be helpful in of itself as it is > currently defined. I was assuming a definition of "zh" as the written > form used in Chinese dialect communities, but that does not appear to > be the correct definition. It would not be Mandarin Chinese because > it can be read all over the country by speakers of the different > dialects. The language ID "zh" is defined in ISO 639 to mean "Chinese" (see http://www.loc.gov/standards/iso639-2/langcodes.html#uvwxyz). This ID was first used by terminologists in situations in which it most likely meant Mandarin, but as you note there's a measure of language-neutrality to this written form. It was adopted for use in software (e.g. POSIX locale IDs zh_CN, zh_TW), and then on the Internet and Web (via RFC 1766, later replaced by RFC 3066). In all of these contexts, it could be seen as either Mandarin or generic-Chinese depending on one's perspective. An interesting development arose in late 1999, however: the following language tags for use in RFC 3066 applications were registered for various Chinese languages: zh-gan Kan or Gan zh-guoyu Mandarin or Standard Chinese zh-hakka Hakka zh-min Min, Fuzhou, Hokkien, Amoy or Taiwanese zh-wuu Shanghaiese or Wu zh-xiang Xiang or Hunanese zh-yue Cantonese That established two interesting precedents, IMO: 1. that "zh" had the generic-Chinese meaning 2. that RFC 3066 language tags could take the form X-Y where X is an ISO 639 ID denoting a somewhat-generic language variety and Y is a qualifier identifying a specific language variety encompassed by X. A new part to the ISO 639 series of standards is in preparation (ISO 639-3, about to be circulated for DIS ballot) that provides a much larger set of IDs with far more complete coverage than is currently available. This will provide three-letter IDs for each of the Chinese languages. Once published, it is expected that it will be incorporated into a revision of RFC 3066. At that point, there will be identifiers available for each individual Chinese language; e.g. "gan". It's possible the revision to RFC 3066 that incorporates ISO 639-3 would give users the option to use either "gan" or "zh-gan". Peter Constable
Received on Wednesday, 15 December 2004 15:13:50 UTC