Re: Sources for Encoding specification from Anne van Kesteren on 2012-04-18 (public-i18n-core@w3.org from April to June 2012)

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 18 Apr 2012 09:09:30 +0200
To: "Norbert Lindenberg" <w3@norbertlindenberg.com>
Cc: public-i18n-core@w3.org
Message-ID: <op.wcx8t4me64w2qv@annevk-macbookpro.local>

On Wed, 18 Apr 2012 08:15:17 +0200, Norbert Lindenberg  
<w3@norbertlindenberg.com> wrote:
> A spec on encoding handling for the web should probably focus on those  
> encodings that are most commonly used on the web. Mark Davis sometimes  
> publishes data in that area; he may be able to provide more detail.
> http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html
> What browsers currently support may be influenced by which libraries  
> they use, and the libraries may have accumulated encodings that aren't  
> relevant to the web.

Yeah, if we can do more research that would be great. I think most  
browsers indeed just use libraries, but Opera and Chrome are a bit more  
restrictive. I can't say much about Opera, but Chrome has a modified  
version of ICU with support for many encodings disabled, as well as a  
custom implementation for euc-jp and a few other tweaks. Gecko has  
similarly been making some changes to its encoding support over the years  
with respect to what extensions to implement of various encodings (and  
more recently disabled utf-7 and utf-32 support).

>> Is there any utf-8 specification that says otherwise? You get U+FFFD,  
>> but the sequences are definitely supported.
>
> The UTF-8 specification (in the Unicode Standard, in ISO 10646, in RFC  
> 3629) was updated years ago to only allow sequences up to four bytes.  
> But I suppose it doesn't really matter whether a sequence of five or six  
> bytes is allowed and maps to U+FFFD because it's above U+10FFFF, or it's  
> treated as an error directly and replaced with U+FFFD...

My apologies, for some reason I thought both Unicode and  
http://tools.ietf.org/html/rfc3629 still defined handling them as five-  
and six-byte sequences (even though they are invalid). As far as I know  
implementations have not changed with respect to this.

-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Wednesday, 18 April 2012 07:10:12 UTC