- From: Stephen Deach <sdeach@adobe.com>
- Date: Sun, 24 Sep 2006 07:54:45 -0700
- To: Martin Duerst <duerst@it.aoyama.ac.jp>, "Mark Davis" <mark.davis@icu-project.org>, "Stephen Deach" <sdeach@adobe.com>
- Cc: "Misha Wolf" <Misha.Wolf@reuters.com>, "Richard Ishida" <ishida@w3.org>, www-international@w3.org, ltru@ietf.org
The fact that "utf-8" and "pdf" (which I would call "encodings") show up in this list points out something interesting about the term "language" that should probably be emphasized in the paper/article. At 2006.09.24-11:56(+0900), Martin Duerst wrote: >Hello Mark, > >Many thanks for this interesting data. In my mailer, the line endings >didn't work very well, so I refomatted it (at the same time, my >mailer messed up the non-ASCII stuff, sorry). >The only three-letter code I can see is pdf, for which we can blame >Steve's company :-). > >On the other hand, in the list, there are a few items (such as en-us, >pt-br) that look perfectly fine. What was wrong with them? > >Regards, Martin. > >At 01:31 06/09/24, Mark Davis wrote: > >I can appreciate the goal. In the case of language tags, we've done > >some analysis here at Google, and at least in a (large) sample of web > >pages and xml documents, the three-letter codes don't account for many > >of the problem cases. > > > >Total-Valid 99.62% > >Total-WellFormed 99.71% > > >Here are some examples of what we do find. > > >Ill-formed: > >(the second one has a space at the end. this also excludes x-.... > >where the ... is a subtag longer than 8 -- that has a pretty high > >frequency) > > >Rank Frequency tag > > >102 0.015999% en-us. > >122 0.010068% en-us > >219 0.001668% es-es-ts > >302 0.000638% q=0.5 > >304 0.000634% undefined > >339 0.000429% espa$B~A%9(Bol > >391 0.000325% Indonesian > >458 0.000185% utf-8 > >464 0.000178% pt-br > >467 0.000173% t$B~A%9(Brk$B~A%9(Be > >481 0.000158% portugues > >503 0.000138% de > >518 0.000126% es-ES-TS > >529 0.000120% Vietnamese > >547 0.000107% sr-sp-latn > >549 0.000107% e > >555 0.000102% Language T20029 2005-05-18 > >... > > > >Well-Formed but Invalid > >88 0.024632% en-securid > >133 0.007796% English > >136 0.007353% xl > >160 0.004739% Chinese > >176 0.003235% zs > >182 0.003062% us > >183 0.003054% chinese > >184 0.002891% eses > >188 0.002497% in > >189 0.002461% pdf > >210 0.001827% en-sp > >213 0.001771% es-sp > >248 0.001150% zh-chs > >254 0.001088% French > >262 0.001019% po > >276 0.000873% sr-SP > >279 0.000865% no-bok > >284 0.000803% Arabic > >293 0.000733% sr_SP > >299 0.000667% en-en > >303 0.000637% ua > >318 0.000570% jp > > > > >#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University >#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp ---Steve Deach sdeach@adobe.com
Received on Sunday, 24 September 2006 14:55:07 UTC