Re: Updated article: Two-letter or three-letter language codes from Mark Davis on 2006-09-23 (www-international@w3.org from July to September 2006)

From: Mark Davis <mark.davis@icu-project.org>
Date: Sat, 23 Sep 2006 09:31:58 -0700
To: "Stephen Deach" <sdeach@adobe.com>
Cc: "Martin Duerst" <duerst@it.aoyama.ac.jp>, "Misha Wolf" <Misha.Wolf@reuters.com>, "Richard Ishida" <ishida@w3.org>, www-international@w3.org, ltru@ietf.org
Message-ID: <30b660a20609230931y3c7ed6bah38549885e72b1aa8@mail.gmail.com>

I can appreciate the goal. In the case of language tags, we've done
some analysis here at Google, and at least in a (large) sample of web
pages and xml documents, the three-letter codes don't account for many
of the problem cases.

Total-Valid 99.62%
Total-WellFormed 99.71%

Here are some examples of what we do find.

Ill-formed:
(the second one has a space at the end. this also excludes x-....
where the ... is a subtag longer than 8 -- that has a pretty high
frequency)
Rank Frequency tag
102 0.015999% en-us.
122 0.010068% en-us
219 0.001668% es-es-ts
302 0.000638% q=0.5
304 0.000634% undefined
339 0.000429% espa�ol
391 0.000325% Indonesian
458 0.000185% utf-8
464 0.000178% pt-br
467 0.000173% t�rk�e
481 0.000158% portugues
503 0.000138% de
518 0.000126% es-ES-TS
529 0.000120% Vietnamese
547 0.000107% sr-sp-latn
549 0.000107% e
555 0.000102% Language T20029 2005-05-18
...

Well-Formed but Invalid
88 0.024632% en-securid
133 0.007796% English
136 0.007353% xl
160 0.004739% Chinese
176 0.003235% zs
182 0.003062% us
183 0.003054% chinese
184 0.002891% eses
188 0.002497% in
189 0.002461% pdf
210 0.001827% en-sp
213 0.001771% es-sp
248 0.001150% zh-chs
254 0.001088% French
262 0.001019% po
276 0.000873% sr-SP
279 0.000865% no-bok
284 0.000803% Arabic
293 0.000733% sr_SP
299 0.000667% en-en
303 0.000637% ua
318 0.000570% jp
...

On 9/23/06, Stephen Deach <sdeach@adobe.com> wrote:
>
> I just wanted to make sure this "shortest code" issue was considered
> carefully.
>    A lot of people I've talked to about internationalization issues over
> the years simply had "assumed" that the 3-letter ISO codes superceded the
> 2-letter ones, or chose to use all 3-letter codes rather than a mix of 2 &
> 3 because it was easier to make it a fixed-length field.
>
> I understand your goal is to eventually make this simpler, by eliminating
> multiple formats for each subtoken and moving to a single registry/list. As
> a general process I always try to accept ill-formed input, but emit
> corrected output (since you pretty much have to grandfather all past formats).
>
>
> At 2006.09.23-11:29(+0900), Martin Duerst wrote:
> >Exactly. Codes should be converted at the boundaries to systems that
> >can't handle anything else that three-letter codes. It has to be done
> >one way, so it can as well be done both ways.
> >
> >Regards,   Martin.
> >
> >At 00:07 06/09/23, Misha Wolf wrote:
> > >
> > >That would be seriously broken.  It would encourage
> > >people to violate BCP 47.
> > >
> > >Misha
> > >
> > >
> > >-----Original Message-----
> > >From: Stephen Deach [mailto:sdeach@adobe.com]
> > >Sent: 22 September 2006 16:05
> > >To: Misha Wolf; Richard Ishida
> > >Cc: www-international@w3.org
> > >Subject: RE: Updated article: Two-letter or three-letter language codes
> > >
> > >I would strongly recomment taht all processing applications support both
> > >2
> > >& 3 letter ISO codes. It was the only way to get some countries and some
> > >
> > >applications (especially in business databases) simply always use the 3
> > >letter coded.
> > >
> > >
> > >This email was sent to you by Reuters, the global news and information
> > company.
> > >To find out more about Reuters visit www.about.reuters.com
> > >
> > >Any views expressed in this message are those of the individual sender,
> > >except where the sender specifically states them to be the views of
> > Reuters Ltd.
> >
> >
> >#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
> >#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
>
>
> ---Steve Deach
>     sdeach@adobe.com
>
>
>

Received on Saturday, 23 September 2006 16:32:08 UTC