Re: Updated article: Two-letter or three-letter language codes

Hello Mark,

Many thanks for this interesting data. In my mailer, the line endings
didn't work very well, so I refomatted it (at the same time, my
mailer messed up the non-ASCII stuff, sorry).
The only three-letter code I can see is pdf, for which we can blame
Steve's company :-).

On the other hand, in the list, there are a few items (such as en-us,
pt-br) that look perfectly fine. What was wrong with them?

Regards,   Martin.

At 01:31 06/09/24, Mark Davis wrote:
>I can appreciate the goal. In the case of language tags, we've done
>some analysis here at Google, and at least in a (large) sample of web
>pages and xml documents, the three-letter codes don't account for many
>of the problem cases.

> >Total-Valid  99.62%
>Total-WellFormed       99.71%
> >Here are some examples of what we do find.
> >Ill-formed:
>(the second one has a space at the end. this also excludes x-....
>where the ... is a subtag longer than 8 -- that has a pretty high
>frequency)

>Rank   Frequency       tag

>102    0.015999%       en-us.
>122    0.010068%       en-us
>219    0.001668%       es-es-ts
>302    0.000638%       q=0.5
>304    0.000634%       undefined
>339    0.000429%       espa$B~A%9(Bol
>391    0.000325%       Indonesian
>458    0.000185%       utf-8
>464    0.000178%       pt-br
>467    0.000173%       t$B~A%9(Brk$B~A%9(Be
>481    0.000158%       portugues
>503    0.000138%       de
>518    0.000126%       es-ES-TS
>529    0.000120%       Vietnamese
>547    0.000107%       sr-sp-latn
>549    0.000107%       e
>555    0.000102%       Language T20029 2005-05-18
>...
> 
>Well-Formed but Invalid
>88     0.024632%       en-securid
>133    0.007796%       English
>136    0.007353%       xl
>160    0.004739%       Chinese
>176    0.003235%       zs
>182    0.003062%       us
>183    0.003054%       chinese
>184    0.002891%       eses
>188    0.002497%       in
>189    0.002461%       pdf
>210    0.001827%       en-sp
>213    0.001771%       es-sp
>248    0.001150%       zh-chs
>254    0.001088%       French
>262    0.001019%       po
>276    0.000873%       sr-SP
>279    0.000865%       no-bok
>284    0.000803%       Arabic
>293    0.000733%       sr_SP
>299    0.000667%       en-en
>303    0.000637%       ua
>318    0.000570%       jp




#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

Received on Sunday, 24 September 2006 06:03:57 UTC