W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: Language Identifier List up for comments

From: Tex Texin <tex@xencraft.com>
Date: Thu, 16 Dec 2004 10:19:32 -0800
Message-ID: <41C1D1B4.7E5A6745@xencraft.com>
To: Richard Ishida <ishida@w3.org>
CC: www-international@w3.org, 'IETF Languages' <ietf-languages@iana.org>

Richard,

Well as you know, the choice between zh-CN and zh-TW is usually driven by the
script,
and I have no issue with recommending zh-guoyu, other than it also doesn't
indicate script and we don't (yet) have hant and hans flavors.

As for what we are doing with this page, speaking for myself, I am just
evaluating if there is a pattern that we can describe as a recommendation to
non-linguists for choosing a tag to label their content or to query for
content.
I was looking to identify when to use or not use regional or more generally
secondary subtags.

What seems clear to me is that it is difficult to know when to use the regional
tags since the criteria for distinguishing languages is not generally available
and there is often disagreement among experts or at least the well-informed.
And the fact that there are so many choices, such as you cite for Chinese,
means the average Joe Content-provider is not going to make an appropriate
selection.

(At least with script tags, the choices are clear.)

As for whether the list is best or current practices, it is neither. It is just
identifying the choices as vetted by these two mail lists. The meaning of many
of the tags seems ill-defined, as one might expect given that languages do not
respect country boundaries, and evolve in communities that are not regionally
based, and even less so with modern communications facilities.

I doubt there are many people who can answer questions such as "what does xx-YY
mean?", other than the nebulous reply language xx spoken/written in country YY.
Maybe, if someone published a list of spelling choices, words and phrases, etc
(and other criteria used by linguists to differentiate these variations) used
in one region versus another, then we might have a criteria by which we could
look at content and identify its appropriate language tag.

I will change the wording to not say "require a language subtag and country
subtag", I agree "require" is too strong.

tex

Richard Ishida wrote:
> 
> > Since there are only two tags for CN, zh-CN and zh-hans-CN,
> > would those who argue for not overdifferentiating tags,
> > recommend just the simpler zh-CN?
> > Similarly for TW, just zh-TW?
> 
> What does zh-CN mean?
> 
> It is most commonly used as far as I'm aware to indicate text written in the
> Simplified Chinese script.  For identification of the script I think we
> should recommend zh-Hans first these days - although we need to add caveats
> about the fact that some applications won't recognise it (eg. for automatic
> application of fonts in Unicode encoded Web pages on some browsers (see
> http://www.w3.org/International/tests/results/lang-and-cjk-font). There are
> not a huge number of applications, as far as I'm aware.)
> 
> Use of zh-CN doesn't seem to make sense for identifying spoken Chinese,
> since there are many dialects in China.  I think one should recommend
> zh-guoyu, zh-yue, etc. for this purpose.
> 
> Note also that Mandarin, Cantonese, Hakka, etc are spoken in many parts of
> the world.  My expectation is that the use of CN would only be appropriate
> if one wanted to explicitly make the point that one was referring to the
> language as spoken in Mainland China - ie. that there is some particular
> characteristic of the instance of text or audio recording that was
> idiosyncratic to that particular area as a whole.
> 
> And now what does zh-TW mean?  Well usually text written in Traditional
> Chinese script, although the we could repeat much of what I wrote above
> about zh-CN for this too.  zh-TW taken literally means the Chinese spoken in
> Taiwan - which happens to be Mandarin.  So unless you have particular
> distinguishing features in mind, perhaps, again you should just use
> zh-guoyu.
> 
> Then there's the question: what are we doing with this page?  Describing
> current usage or recommending best practises.  If the latter, perhaps zh-CN
> and zh-TW should only appear on the page if clearly marked as edge cases.
> 
> Btw, what does de-CH represent in the table?  Swiss German is different from
> de-DE, and rarely written, and then has little consistency to its
> orthography.  There are also many local variants to Swiss German across
> Switzerland, which would seem to invite a large number of additions to this
> table.  But presumably de-CH refers to the way de-DE German is written in
> Switzerland or spoken by newsreaders there (and there are a small number of
> significant differences here from de-DE.)?  If so, we ought to clarify that
> in the table.
> 
> I think this kind of process could be applied to many other parts of the
> second table, which worries me.  I can't help thinking that it might be
> better to talk through some examples of when to use en and when to use en-GB
> or en-US, talk through the choices for particular problem areas like chinese
> and swiss german, and so on, rather than to just list these combinations,
> most of which you could determine pretty easily anyway if you gave what you
> were doing a small amount of thought and had access to a list of country
> codes.
> 
> What might be more useful is to say, here is the simplest form to identify
> this language (eg. 'en'), and in the next column are a bunch of potential
> country or other codes you may want to consider using in conjunction with
> this.  Rather than, "This table lists the languages" and " require a
> language subtag and country subtag".
> 
> RI
> 

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Thursday, 16 December 2004 18:19:36 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT