W3C home > Mailing lists > Public > www-international@w3.org > July to September 2006

Re: Updated article: Two-letter or three-letter language codes

From: Mark Davis <mark.davis@icu-project.org>
Date: Sat, 23 Sep 2006 09:31:58 -0700
Message-ID: <30b660a20609230931y3c7ed6bah38549885e72b1aa8@mail.gmail.com>
To: "Stephen Deach" <sdeach@adobe.com>
Cc: "Martin Duerst" <duerst@it.aoyama.ac.jp>, "Misha Wolf" <Misha.Wolf@reuters.com>, "Richard Ishida" <ishida@w3.org>, www-international@w3.org, ltru@ietf.org
I can appreciate the goal. In the case of language tags, we've done
some analysis here at Google, and at least in a (large) sample of web
pages and xml documents, the three-letter codes don't account for many
of the problem cases.

Total-Valid	99.62%
Total-WellFormed	99.71%

Here are some examples of what we do find.

Ill-formed:
(the second one has a space at the end. this also excludes x-....
where the ... is a subtag longer than 8 -- that has a pretty high
frequency)
Rank	Frequency	tag
102	0.015999%	en-us.
122	0.010068%	en-us
219	0.001668%	es-es-ts
302	0.000638%	q=0.5
304	0.000634%	undefined
339	0.000429%	espa�ol
391	0.000325%	Indonesian
458	0.000185%	utf-8
464	0.000178%	pt-br
467	0.000173%	t�rk�e
481	0.000158%	portugues
503	0.000138%	de
518	0.000126%	es-ES-TS
529	0.000120%	Vietnamese
547	0.000107%	sr-sp-latn
549	0.000107%	e
555	0.000102%	Language T20029 2005-05-18
...

Well-Formed but Invalid
88	0.024632%	en-securid
133	0.007796%	English
136	0.007353%	xl
160	0.004739%	Chinese
176	0.003235%	zs
182	0.003062%	us
183	0.003054%	chinese
184	0.002891%	eses
188	0.002497%	in
189	0.002461%	pdf
210	0.001827%	en-sp
213	0.001771%	es-sp
248	0.001150%	zh-chs
254	0.001088%	French
262	0.001019%	po
276	0.000873%	sr-SP
279	0.000865%	no-bok
284	0.000803%	Arabic
293	0.000733%	sr_SP
299	0.000667%	en-en
303	0.000637%	ua
318	0.000570%	jp
...

On 9/23/06, Stephen Deach <sdeach@adobe.com> wrote:
>
> I just wanted to make sure this "shortest code" issue was considered
> carefully.
>    A lot of people I've talked to about internationalization issues over
> the years simply had "assumed" that the 3-letter ISO codes superceded the
> 2-letter ones, or chose to use all 3-letter codes rather than a mix of 2 &
> 3 because it was easier to make it a fixed-length field.
>
> I understand your goal is to eventually make this simpler, by eliminating
> multiple formats for each subtoken and moving to a single registry/list. As
> a general process I always try to accept ill-formed input, but emit
> corrected output (since you pretty much have to grandfather all past formats).
>
>
> At 2006.09.23-11:29(+0900), Martin Duerst wrote:
> >Exactly. Codes should be converted at the boundaries to systems that
> >can't handle anything else that three-letter codes. It has to be done
> >one way, so it can as well be done both ways.
> >
> >Regards,   Martin.
> >
> >At 00:07 06/09/23, Misha Wolf wrote:
> > >
> > >That would be seriously broken.  It would encourage
> > >people to violate BCP 47.
> > >
> > >Misha
> > >
> > >
> > >-----Original Message-----
> > >From: Stephen Deach [mailto:sdeach@adobe.com]
> > >Sent: 22 September 2006 16:05
> > >To: Misha Wolf; Richard Ishida
> > >Cc: www-international@w3.org
> > >Subject: RE: Updated article: Two-letter or three-letter language codes
> > >
> > >I would strongly recomment taht all processing applications support both
> > >2
> > >& 3 letter ISO codes. It was the only way to get some countries and some
> > >
> > >applications (especially in business databases) simply always use the 3
> > >letter coded.
> > >
> > >
> > >This email was sent to you by Reuters, the global news and information
> > company.
> > >To find out more about Reuters visit www.about.reuters.com
> > >
> > >Any views expressed in this message are those of the individual sender,
> > >except where the sender specifically states them to be the views of
> > Reuters Ltd.
> >
> >
> >#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> >#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
>
>
> ---Steve Deach
>     sdeach@adobe.com
>
>
>
Received on Saturday, 23 September 2006 16:32:08 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:08 GMT