Re: Updated article: Two-letter or three-letter language codes

The fact that "utf-8" and "pdf" (which I would call "encodings") show up in 
this list points out something interesting about the term "language" that 
should probably be emphasized in the paper/article.


At 2006.09.24-11:56(+0900), Martin Duerst wrote:
>Hello Mark,
>
>Many thanks for this interesting data. In my mailer, the line endings
>didn't work very well, so I refomatted it (at the same time, my
>mailer messed up the non-ASCII stuff, sorry).
>The only three-letter code I can see is pdf, for which we can blame
>Steve's company :-).
>
>On the other hand, in the list, there are a few items (such as en-us,
>pt-br) that look perfectly fine. What was wrong with them?
>
>Regards,   Martin.
>
>At 01:31 06/09/24, Mark Davis wrote:
> >I can appreciate the goal. In the case of language tags, we've done
> >some analysis here at Google, and at least in a (large) sample of web
> >pages and xml documents, the three-letter codes don't account for many
> >of the problem cases.
>
> > >Total-Valid  99.62%
> >Total-WellFormed       99.71%
> > >Here are some examples of what we do find.
> > >Ill-formed:
> >(the second one has a space at the end. this also excludes x-....
> >where the ... is a subtag longer than 8 -- that has a pretty high
> >frequency)
>
> >Rank   Frequency       tag
>
> >102    0.015999%       en-us.
> >122    0.010068%       en-us
> >219    0.001668%       es-es-ts
> >302    0.000638%       q=0.5
> >304    0.000634%       undefined
> >339    0.000429%       espa$B~A%9(Bol
> >391    0.000325%       Indonesian
> >458    0.000185%       utf-8
> >464    0.000178%       pt-br
> >467    0.000173%       t$B~A%9(Brk$B~A%9(Be
> >481    0.000158%       portugues
> >503    0.000138%       de
> >518    0.000126%       es-ES-TS
> >529    0.000120%       Vietnamese
> >547    0.000107%       sr-sp-latn
> >549    0.000107%       e
> >555    0.000102%       Language T20029 2005-05-18
> >...
> >
> >Well-Formed but Invalid
> >88     0.024632%       en-securid
> >133    0.007796%       English
> >136    0.007353%       xl
> >160    0.004739%       Chinese
> >176    0.003235%       zs
> >182    0.003062%       us
> >183    0.003054%       chinese
> >184    0.002891%       eses
> >188    0.002497%       in
> >189    0.002461%       pdf
> >210    0.001827%       en-sp
> >213    0.001771%       es-sp
> >248    0.001150%       zh-chs
> >254    0.001088%       French
> >262    0.001019%       po
> >276    0.000873%       sr-SP
> >279    0.000865%       no-bok
> >284    0.000803%       Arabic
> >293    0.000733%       sr_SP
> >299    0.000667%       en-en
> >303    0.000637%       ua
> >318    0.000570%       jp
>
>
>
>
>#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp


---Steve Deach
    sdeach@adobe.com 

Received on Sunday, 24 September 2006 14:55:07 UTC