language declaration stats (was Re: meta content-language)

Hi Erik,

Thanks for the useful update.  I think it's quite significant that there is
a 13% increase in the use of lang versus a 3% increase in the use of meta in
the same period.

Any way to tell what percentage of pages use both lang and xml:lang at the
same time for each of these figures?  That would then also give us a total
figure for use of attributes.

Also what percentage of pages using attributes also use meta
Content-Language?

Cheers,
RI


============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/



> -----Original Message-----
> From: Erik van der Poel [mailto:erikv@google.com]
> Sent: 26 August 2008 01:32
> To: Martin Duerst
> Cc: Henri Sivonen; Richard Ishida; Ian Hickson; HTML WG; www-
> international@w3.org
> Subject: Re: meta content-language
> 
> On Thu, Aug 21, 2008 at 7:16 PM, Martin Duerst <duerst@it.aoyama.ac.jp>
wrote:
> > At 16:36 08/08/15, Henri Sivonen wrote:
> >>Of course, if the data is *wrong* significantly more often than
> >>lang='' (assuming that the correctness level of lang='' establishes an
> >>implicit data quality baseline), it would be good to ignore it. My
> >>guess is that HTTP-level Content-Language is more likely to be wrong
> >>(it sure is less obvious to diagnose) than any HTML-level declaration.
> >>(Due to Ruby's Postulate:
> >>http://intertwingly.net/slides/2004/devcon/68.html )
> >
> > I guess Google might be able to come up with some data.
> > I have copied Erik van der Poel, an expert in this area.
> >
> > My guess is that:
> > - Authors who declare something usually use lang/xml:lang,
> >  and meta maybe as an addition.
> > - Some tools may use meta, but the chance that the author
> >  corrects this if necessary is low (this is different from
> >  the charset case, because the charset case is very
> >  visible/actionable).
> 
> >From 2001 to 2007, <html lang="..."> usage increased from 2% to 15% of
> HTML documents in Google's index, while <html xml:lang="..."> usage
> increased from 0.4% to 9% in the same period.
> 
> On the other hand, <meta http-equiv=Content-Language content=...>
> usage increased from 5% to 8%, while HTTP Content-Language increased
> from 1% to 6%.
> 
> I don't know how many of the declared languages are "wrong", but I can
> compare them with our language detector's result, for the languages
> that are supported by our detector. For <html lang="...">, 13.0% were
> different. For the meta Content-Language, 11.4% were different, while
> for HTTP Content-Language, 11.0% were different. (These numbers are
> quite similar, so I don't know whether we can speak of a Ruby effect.)
> 
> Many of the differences for <html lang="..."> were for documents that
> had lang="en" while our detector returned a different result. Perhaps
> "en" is the default value, and is not being modified by
> authors/admins.
> 
> Erik

Received on Wednesday, 27 August 2008 09:53:03 UTC