W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: Language Identifier List Criteria

From: Tex Texin <tex@xencraft.com>
Date: Mon, 20 Dec 2004 19:49:19 -0800
Message-ID: <41C79D3F.9874122C@xencraft.com>
To: Mark Davis <mark.davis@jtcsv.com>
CC: Georg Schweizer <gschweizer@gmx.at>, www-international@w3.org, ietf-languages@alvestrand.no, John Cowan <jcowan@reutershealth.com>

Mark,

For quotations, I prefer Lewis Carroll's:

'When I use a word,' Humpty Dumpty said in rather a scornful tone, 'it means
just what I choose it to mean -- neither more nor less.' 

;-)


The distinctions you are drawing strike me a bit like the difference between
precision and accuracy- you can have one without the other. But that is an
aside.

Now if the meaning were not ambiguous, I could take something written by a
German, with blond hair, living and writing in Lichtenstein, and know that I
should label it de-LI. But, I am being told it is wrong to do that, I should
use de-CH.

And if I have an article written in Japanese, by a red head (but dyed), in
Japan, if I label it ja-JP, I am also wrong, it  should be just ja.

And when I ask, how to determine that what I am told is right, I am not getting
an answer I can apply, just some virtual hand waving.

So, whereas some might consider these edge cases, I consider the lack of
criteria tantamount to ambiguity.

However, just to be clear, in case my mail is being misinterpreted as an attack
on 3066 or its successor(s), that is not at all my intent. RFC 3066 is a
valuable tool, needed for consistency and interoperability. We can agree on
3066 as a naming convention. But there is also a need for guidelines for
semantics of the language tags. It's possible that there might be a need for
several different guidelines, one for linguists, one for software message
catalogs, one for labeling web pages, one for use in web services, etc. We have
noted in the thread that it might need to vary by application.
If so, so be it. I am just looking for a beginning to identifying the criteria
for some contexts.
If my comments were taken as an objection to 3066bis, they were not intended,
nor should they be taken, to be so.

Meanwhile, I await more suggestions for criteria.

On another related topic, I am considering for the next version of the table to
organize it differently.
It strikes me that for my needs, and my intended audience, that it is not as
interesting to list languages and noting which regions they are spoken, as to
list each of the regions and note the languages used there.
If I do that, I do not have to deal with meaningless identifiers and map them
to the correct ones to use.

So I might have:

region	languages
JP	ja
LI	de-CH (maybe others, I don't know.)
CH	de-CH, it, fr-FR, rm
US	en-US, es-US
CA	fr-CA, en-CA, iu

With this approach, I can suggest something like the most popular choices, not
rule out the existence of other languages being used, the lack of de-LI makes a
statement about de-LI vs de-CH, without being as explicit about criteria, other
than perhaps a combination of popular choices by major software vendors,
offical languages, and claims of encyclopedias and the like as to what is
spoken where.

This approach is more helpful to folks like me who are looking to answer what
they need to provide. If someone wants to know how many variants of German they
need, they can scan the table for all listings of de and de-*, and even scan
just the regions they support to determine their perhaps more exact needs.

It's also easier for me to accept edits of the list from people suggesting that
language xx-YY is used in region ZZ, without a lot of vetting effort.

Would that work for people?

tex

Mark Davis wrote:
> 
> > However, RFC 3066's approach is generative. So de-AT is created by
> combining
> > codes from each of ISO 639 and ISO 3166, and neither defines what this
> means.
> > In fact RFC 3066 only defines the production and not which of the produced
> > values are meaningful or what they mean except in the most general terms.
> 
> You keep expressing this in a counterproductive way. The language tag
> 'de-AT' is reasonably defined: German as used in Austria.
> 
> Suppose I have a protocol ID that distinguishes categories of people by
> combining hair-color with nationality. Then "Samoan, blond" is perfectly
> well defined. The fact that there are no existing examples does *NOT* mean
> that it is "ambiguous", "not meaningful", or "not well-defined". And let's
> suppose that all Danes were blond. Then "Dane, blond" would still be well
> defined. The fact that it happens to have the same current denotation as
> "Dane" does *NOT* mean that it is "ambiguous",  "not meaningful", or "not
> well-defined".
> 
> To paraphrase Inigo Montoya, "I do not think those words mean what you think
> they mean!"
> 
> Now, there are in fact edge cases; when is someone dishwater blond vs pale
> brunette; what do you do with dual citizenship, etc. But it doesn't mean
> that the protocol is senseless. And once you have a criterion of usage, you
> can establish when two IDs have the same denotation or not, by doing the
> research to see whether there are in fact non-blond Danes.
> 
> What you are really looking for is which language tags have the same
> denotation, under some criterion of usage. And that criterion might be "does
> someone need to provide different localizations (for non-speech enabled
> applications)". That is *very* different from saying that the language tags
> are "ambiguous",  "not meaningful", or "not well-defined".
> 
> €ŽMark
> 
> ----- Original Message -----
> From: "Tex Texin" <tex@xencraft.com>
> To: "Georg Schweizer" <gschweizer@gmx.at>
> Cc: <www-international@w3.org>; <ietf-languages@alvestrand.no>
> Sent: Monday, December 20, 2004 13:26
> Subject: Re: Language Identifier List Criteria
> 
> > Well, I will leave it to others to debate the characterization of the
> standards
> > as political, if they choose to.
> > However, RFC 3066's approach is generative. So de-AT is created by
> combining
> > codes from each of ISO 639 and ISO 3166, and neither defines what this
> means.
> > In fact RFC 3066 only defines the production and not which of the produced
> > values are meaningful or what they mean except in the most general terms.
> > That's why we are having this discussion.
> >
> > Under RFC 3066 it is possible to create combinations of language and
> region
> > that have no useful value.
> > So this need not be determined politically.
> >
> > For myself, I am looking for guidance for software and web producers.
> Which
> > labels to use when tagging content? When is there a enough of a difference
> that
> > bears paying for a translation?
> >
> > It is not clear to me it should be purely linguistic however.
> > Politics is perhaps one element of the criteria.
> > tex
> >
> >
> > Georg Schweizer wrote:
> > >
> > > > Some languages are spoken in many countries, and the language is not
> > > > distinctive in each country. I have started to accept suggestions as
> > > > to which language-region codes do not represent a distinct language
> > > > variation, and therefore are not recommended as tags, without good
> > > > reason.
> > > http://www.i18nguy.com/unicode/language-identifiers.html
> > >
> > > The criteria should be political rather than linguistic ones, as both
> > > the ISO 639 language tags and the ISO 3166 country codes are based on
> > > political agreement. Therefore I would not speak of "distict language
> > > variations", but of distinct *official standards* (or at least distinct
> > > conventions). Variations can be found everywhere (even within one
> > > political region), whereas the same conventions can be followed by
> > > several countries.
> >
> > --
> > -------------------------------------------------------------
> > Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
> > Xen Master                          http://www.i18nGuy.com
> >
> > XenCraft             http://www.XenCraft.com
> > Making e-Business Work Around the World
> > -------------------------------------------------------------
> >
> > _______________________________________________
> > Ietf-languages mailing list
> > Ietf-languages@alvestrand.no
> > http://www.alvestrand.no/mailman/listinfo/ietf-languages
> >

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Tuesday, 21 December 2004 03:49:33 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT