- From: Stephen Deach <sdeach@adobe.com>
- Date: Sun, 24 Sep 2006 07:54:45 -0700
- To: Martin Duerst <duerst@it.aoyama.ac.jp>, "Mark Davis" <mark.davis@icu-project.org>, "Stephen Deach" <sdeach@adobe.com>
- Cc: "Misha Wolf" <Misha.Wolf@reuters.com>, "Richard Ishida" <ishida@w3.org>, www-international@w3.org, ltru@ietf.org
The fact that "utf-8" and "pdf" (which I would call "encodings") show up in
this list points out something interesting about the term "language" that
should probably be emphasized in the paper/article.
At 2006.09.24-11:56(+0900), Martin Duerst wrote:
>Hello Mark,
>
>Many thanks for this interesting data. In my mailer, the line endings
>didn't work very well, so I refomatted it (at the same time, my
>mailer messed up the non-ASCII stuff, sorry).
>The only three-letter code I can see is pdf, for which we can blame
>Steve's company :-).
>
>On the other hand, in the list, there are a few items (such as en-us,
>pt-br) that look perfectly fine. What was wrong with them?
>
>Regards, Martin.
>
>At 01:31 06/09/24, Mark Davis wrote:
> >I can appreciate the goal. In the case of language tags, we've done
> >some analysis here at Google, and at least in a (large) sample of web
> >pages and xml documents, the three-letter codes don't account for many
> >of the problem cases.
>
> > >Total-Valid 99.62%
> >Total-WellFormed 99.71%
> > >Here are some examples of what we do find.
> > >Ill-formed:
> >(the second one has a space at the end. this also excludes x-....
> >where the ... is a subtag longer than 8 -- that has a pretty high
> >frequency)
>
> >Rank Frequency tag
>
> >102 0.015999% en-us.
> >122 0.010068% en-us
> >219 0.001668% es-es-ts
> >302 0.000638% q=0.5
> >304 0.000634% undefined
> >339 0.000429% espa$B~A%9(Bol
> >391 0.000325% Indonesian
> >458 0.000185% utf-8
> >464 0.000178% pt-br
> >467 0.000173% t$B~A%9(Brk$B~A%9(Be
> >481 0.000158% portugues
> >503 0.000138% de
> >518 0.000126% es-ES-TS
> >529 0.000120% Vietnamese
> >547 0.000107% sr-sp-latn
> >549 0.000107% e
> >555 0.000102% Language T20029 2005-05-18
> >...
> >
> >Well-Formed but Invalid
> >88 0.024632% en-securid
> >133 0.007796% English
> >136 0.007353% xl
> >160 0.004739% Chinese
> >176 0.003235% zs
> >182 0.003062% us
> >183 0.003054% chinese
> >184 0.002891% eses
> >188 0.002497% in
> >189 0.002461% pdf
> >210 0.001827% en-sp
> >213 0.001771% es-sp
> >248 0.001150% zh-chs
> >254 0.001088% French
> >262 0.001019% po
> >276 0.000873% sr-SP
> >279 0.000865% no-bok
> >284 0.000803% Arabic
> >293 0.000733% sr_SP
> >299 0.000667% en-en
> >303 0.000637% ua
> >318 0.000570% jp
>
>
>
>
>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
---Steve Deach
sdeach@adobe.com
Received on Sunday, 24 September 2006 14:55:07 UTC