W3C home > Mailing lists > Public > public-html@w3.org > August 2009

Re: HTML5 Issue 11 (encoding detection): I18N WG response...

From: Maciej Stachowiak <mjs@apple.com>
Date: Thu, 20 Aug 2009 00:25:14 -0700
Cc: "Phillips, Addison" <addison@amazon.com>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-id: <882ACA1B-3189-4FDE-A539-7AA71BD3307D@apple.com>
To: Henri Sivonen <hsivonen@iki.fi>

On Aug 20, 2009, at 12:14 AM, Henri Sivonen wrote:

> On Aug 20, 2009, at 10:06, Phillips, Addison wrote:
>
>> I think the world has changed significantly. In the past, setting a  
>> default of UTF-8 in your browser produced mainly bad results. But,  
>> at least according to some measures [1], UTF-8 is rapidly becoming  
>> the most reasonable default encoding on the Web.
> [...]
>> [1] http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
>
> This shows an uptake in UTF-8, but it proves nothing without data on  
> how much is labeled and how much unlabeled. Uptake in labeled UTF-8  
> is awesome but doesn't affect what makes sense as the default  
> processing for unlabeled data.

This is the key point. The relevant statistic for the default encoding  
is the predominant encoding for unlabeled content. We could do a new  
study on this, but to the best of my knowledge, UTF-8 is rare.

Also, it's been mentioned that UTF-8 can be heuristically detected  
without too much effort. If that's the case, then it does not make  
much sense to make it the fallback after algorithmic detection has  
failed.

That being said, I agree that the uptake of UTF-8 is awesome, and I  
think everyone would like to see public Web content move to UTF-8 as  
much as possible. The only question is how to do this in light of  
legacy constraints.

  - Maciej

>
>> At the same time, I think UTF-8 is more than a politically correct  
>> fig leaf. The more standards and implementations stress good  
>> choices, the more likely people (users, content authors) are to  
>> take them seriously. If you happen to have chosen UTF-8 as an  
>> encoding, your pages are more likely to just work. Recommending  
>> UTF-8 as a default probably will continue to establish itself as  
>> the right choice as time progresses. Remember: this is the "all  
>> else fails" result and is exposed to user intervention by nearly  
>> all user agents.
>
> HTML 5 already recommends (labeled) UTF-8 as the default for  
> authoring tools.
>
> -- 
> Henri Sivonen
> hsivonen@iki.fi
> http://hsivonen.iki.fi/
>
>
>
Received on Thursday, 20 August 2009 07:25:57 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:44 GMT