- From: Mark Davis <mark.davis@icu-project.org>
- Date: Sat, 23 Sep 2006 09:31:58 -0700
- To: "Stephen Deach" <sdeach@adobe.com>
- Cc: "Martin Duerst" <duerst@it.aoyama.ac.jp>, "Misha Wolf" <Misha.Wolf@reuters.com>, "Richard Ishida" <ishida@w3.org>, www-international@w3.org, ltru@ietf.org
I can appreciate the goal. In the case of language tags, we've done some analysis here at Google, and at least in a (large) sample of web pages and xml documents, the three-letter codes don't account for many of the problem cases. Total-Valid 99.62% Total-WellFormed 99.71% Here are some examples of what we do find. Ill-formed: (the second one has a space at the end. this also excludes x-.... where the ... is a subtag longer than 8 -- that has a pretty high frequency) Rank Frequency tag 102 0.015999% en-us. 122 0.010068% en-us 219 0.001668% es-es-ts 302 0.000638% q=0.5 304 0.000634% undefined 339 0.000429% espa�ol 391 0.000325% Indonesian 458 0.000185% utf-8 464 0.000178% pt-br 467 0.000173% t�rk�e 481 0.000158% portugues 503 0.000138% de 518 0.000126% es-ES-TS 529 0.000120% Vietnamese 547 0.000107% sr-sp-latn 549 0.000107% e 555 0.000102% Language T20029 2005-05-18 ... Well-Formed but Invalid 88 0.024632% en-securid 133 0.007796% English 136 0.007353% xl 160 0.004739% Chinese 176 0.003235% zs 182 0.003062% us 183 0.003054% chinese 184 0.002891% eses 188 0.002497% in 189 0.002461% pdf 210 0.001827% en-sp 213 0.001771% es-sp 248 0.001150% zh-chs 254 0.001088% French 262 0.001019% po 276 0.000873% sr-SP 279 0.000865% no-bok 284 0.000803% Arabic 293 0.000733% sr_SP 299 0.000667% en-en 303 0.000637% ua 318 0.000570% jp ... On 9/23/06, Stephen Deach <sdeach@adobe.com> wrote: > > I just wanted to make sure this "shortest code" issue was considered > carefully. > A lot of people I've talked to about internationalization issues over > the years simply had "assumed" that the 3-letter ISO codes superceded the > 2-letter ones, or chose to use all 3-letter codes rather than a mix of 2 & > 3 because it was easier to make it a fixed-length field. > > I understand your goal is to eventually make this simpler, by eliminating > multiple formats for each subtoken and moving to a single registry/list. As > a general process I always try to accept ill-formed input, but emit > corrected output (since you pretty much have to grandfather all past formats). > > > At 2006.09.23-11:29(+0900), Martin Duerst wrote: > >Exactly. Codes should be converted at the boundaries to systems that > >can't handle anything else that three-letter codes. It has to be done > >one way, so it can as well be done both ways. > > > >Regards, Martin. > > > >At 00:07 06/09/23, Misha Wolf wrote: > > > > > >That would be seriously broken. It would encourage > > >people to violate BCP 47. > > > > > >Misha > > > > > > > > >-----Original Message----- > > >From: Stephen Deach [mailto:sdeach@adobe.com] > > >Sent: 22 September 2006 16:05 > > >To: Misha Wolf; Richard Ishida > > >Cc: www-international@w3.org > > >Subject: RE: Updated article: Two-letter or three-letter language codes > > > > > >I would strongly recomment taht all processing applications support both > > >2 > > >& 3 letter ISO codes. It was the only way to get some countries and some > > > > > >applications (especially in business databases) simply always use the 3 > > >letter coded. > > > > > > > > >This email was sent to you by Reuters, the global news and information > > company. > > >To find out more about Reuters visit www.about.reuters.com > > > > > >Any views expressed in this message are those of the individual sender, > > >except where the sender specifically states them to be the views of > > Reuters Ltd. > > > > > >#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University > >#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp > > > ---Steve Deach > sdeach@adobe.com > > >
Received on Saturday, 23 September 2006 16:32:08 UTC