Re: meta content-language from Erik van der Poel on 2008-08-27 (public-html@w3.org from August 2008)

From: Erik van der Poel <erikv@google.com>
Date: Tue, 26 Aug 2008 23:10:57 -0700
To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
Cc: "Henri Sivonen" <hsivonen@iki.fi>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "HTML WG" <public-html@w3.org>, www-international@w3.org
Message-ID: <c07a32650808262310y76e76665id04f9e427e9396db@mail.gmail.com>

By the way, just to give you some indication of the types of mistakes
that we find on the net, the most common meta language list that
contains at least one comma is "de,at,ch". The "de" is of course a
real language (German), but "at" is a country (Austria) and "ch" is a
language (Chamorro) but probably intended as a country (Switzerland).
If you want to check such pages for yourself, here are 3 examples:

http://www.peterzahlt.de/
http://www.kostenlose-girokonten.com/
http://www.poker-lernen.info/

Maybe a few influential sites started with this mistake and all of the
others simply copied it? Here are the rest of the top 10:

de,at
th,en
pt,pt-pt
fr,fr-be,fr-ca,fr-lu,fr-ch
en,tr
de,en
es,es-es
fr,en
fr,be,ch,qc,en,lu

Maybe "qc" was supposed to be Quebec? (I don't think it's a valid
country/region tag.)

Erik

On Mon, Aug 25, 2008 at 5:32 PM, Erik van der Poel <erikv@google.com> wrote:
> On Thu, Aug 21, 2008 at 7:16 PM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
>> At 16:36 08/08/15, Henri Sivonen wrote:
>>>Of course, if the data is *wrong* significantly more often than
>>>lang='' (assuming that the correctness level of lang='' establishes an
>>>implicit data quality baseline), it would be good to ignore it. My
>>>guess is that HTTP-level Content-Language is more likely to be wrong
>>>(it sure is less obvious to diagnose) than any HTML-level declaration.
>>>(Due to Ruby's Postulate:
>>>http://intertwingly.net/slides/2004/devcon/68.html )
>>
>> I guess Google might be able to come up with some data.
>> I have copied Erik van der Poel, an expert in this area.
>>
>> My guess is that:
>> - Authors who declare something usually use lang/xml:lang,
>>  and meta maybe as an addition.
>> - Some tools may use meta, but the chance that the author
>>  corrects this if necessary is low (this is different from
>>  the charset case, because the charset case is very
>>  visible/actionable).
>
> From 2001 to 2007, <html lang="..."> usage increased from 2% to 15% of
> HTML documents in Google's index, while <html xml:lang="..."> usage
> increased from 0.4% to 9% in the same period.
>
> On the other hand, <meta http-equiv=Content-Language content=...>
> usage increased from 5% to 8%, while HTTP Content-Language increased
> from 1% to 6%.
>
> I don't know how many of the declared languages are "wrong", but I can
> compare them with our language detector's result, for the languages
> that are supported by our detector. For <html lang="...">, 13.0% were
> different. For the meta Content-Language, 11.4% were different, while
> for HTTP Content-Language, 11.0% were different. (These numbers are
> quite similar, so I don't know whether we can speak of a Ruby effect.)
>
> Many of the differences for <html lang="..."> were for documents that
> had lang="en" while our detector returned a different result. Perhaps
> "en" is the default value, and is not being modified by
> authors/admins.
>
> Erik
>

Received on Wednesday, 27 August 2008 06:11:44 UTC