Re: meta content-language from Erik van der Poel on 2008-08-26 (www-international@w3.org from July to September 2008)

From: Erik van der Poel <erikv@google.com>
Date: Mon, 25 Aug 2008 17:32:24 -0700
To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
Cc: "Henri Sivonen" <hsivonen@iki.fi>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "HTML WG" <public-html@w3.org>, www-international@w3.org
Message-ID: <c07a32650808251732i636efac2oab8c13a5b4a37cc8@mail.gmail.com>

On Thu, Aug 21, 2008 at 7:16 PM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> At 16:36 08/08/15, Henri Sivonen wrote:
>>Of course, if the data is *wrong* significantly more often than
>>lang='' (assuming that the correctness level of lang='' establishes an
>>implicit data quality baseline), it would be good to ignore it. My
>>guess is that HTTP-level Content-Language is more likely to be wrong
>>(it sure is less obvious to diagnose) than any HTML-level declaration.
>>(Due to Ruby's Postulate:
>>http://intertwingly.net/slides/2004/devcon/68.html )
>
> I guess Google might be able to come up with some data.
> I have copied Erik van der Poel, an expert in this area.
>
> My guess is that:
> - Authors who declare something usually use lang/xml:lang,
>  and meta maybe as an addition.
> - Some tools may use meta, but the chance that the author
>  corrects this if necessary is low (this is different from
>  the charset case, because the charset case is very
>  visible/actionable).

>From 2001 to 2007, <html lang="..."> usage increased from 2% to 15% of
HTML documents in Google's index, while <html xml:lang="..."> usage
increased from 0.4% to 9% in the same period.

On the other hand, <meta http-equiv=Content-Language content=...>
usage increased from 5% to 8%, while HTTP Content-Language increased
from 1% to 6%.

I don't know how many of the declared languages are "wrong", but I can
compare them with our language detector's result, for the languages
that are supported by our detector. For <html lang="...">, 13.0% were
different. For the meta Content-Language, 11.4% were different, while
for HTTP Content-Language, 11.0% were different. (These numbers are
quite similar, so I don't know whether we can speak of a Ruby effect.)

Many of the differences for <html lang="..."> were for documents that
had lang="en" while our detector returned a different result. Perhaps
"en" is the default value, and is not being modified by
authors/admins.

Erik

Received on Tuesday, 26 August 2008 00:33:10 UTC