- From: Erik van der Poel <erikv@google.com>
- Date: Mon, 25 Aug 2008 17:32:24 -0700
- To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
- Cc: "Henri Sivonen" <hsivonen@iki.fi>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "HTML WG" <public-html@w3.org>, www-international@w3.org
On Thu, Aug 21, 2008 at 7:16 PM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote: > At 16:36 08/08/15, Henri Sivonen wrote: >>Of course, if the data is *wrong* significantly more often than >>lang='' (assuming that the correctness level of lang='' establishes an >>implicit data quality baseline), it would be good to ignore it. My >>guess is that HTTP-level Content-Language is more likely to be wrong >>(it sure is less obvious to diagnose) than any HTML-level declaration. >>(Due to Ruby's Postulate: >>http://intertwingly.net/slides/2004/devcon/68.html ) > > I guess Google might be able to come up with some data. > I have copied Erik van der Poel, an expert in this area. > > My guess is that: > - Authors who declare something usually use lang/xml:lang, > and meta maybe as an addition. > - Some tools may use meta, but the chance that the author > corrects this if necessary is low (this is different from > the charset case, because the charset case is very > visible/actionable). >From 2001 to 2007, <html lang="..."> usage increased from 2% to 15% of HTML documents in Google's index, while <html xml:lang="..."> usage increased from 0.4% to 9% in the same period. On the other hand, <meta http-equiv=Content-Language content=...> usage increased from 5% to 8%, while HTTP Content-Language increased from 1% to 6%. I don't know how many of the declared languages are "wrong", but I can compare them with our language detector's result, for the languages that are supported by our detector. For <html lang="...">, 13.0% were different. For the meta Content-Language, 11.4% were different, while for HTTP Content-Language, 11.0% were different. (These numbers are quite similar, so I don't know whether we can speak of a Ruby effect.) Many of the differences for <html lang="..."> were for documents that had lang="en" while our detector returned a different result. Perhaps "en" is the default value, and is not being modified by authors/admins. Erik
Received on Tuesday, 26 August 2008 00:33:10 UTC