Re: Sources for Encoding specification from Anne van Kesteren on 2012-04-11 (public-i18n-core@w3.org from April to June 2012)

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 11 Apr 2012 09:24:18 +0200
To: "Norbert Lindenberg" <w3@norbertlindenberg.com>
Cc: public-i18n-core@w3.org
Message-ID: <op.wclausyx64w2qv@annevk-macbookpro.local>

Hi all, thanks for your email.

On Tue, 10 Apr 2012 20:52:20 +0200, Norbert Lindenberg  
<w3@norbertlindenberg.com> wrote:
> I'm writing to you on behalf of the W3C Internationalization Core WG  
> [1], which is currently looking at your Encoding specification [2].
>
> We're wondering what sources you used to obtain the information in the  
> specification, such as the list of encodings, their aliases, and the  
> mapping tables for them. Is this derived from looking at the source code  
> of one or more browsers, or from testing their behavior, or from a web  
> index that shows which encodings are most commonly used on the web and  
> how?

Encodings and labels are found by looking at IANA and reverse engineering  
browsers. http://wiki.whatwg.org/wiki/Web_Encodings has some of the early  
data. http://annevankesteren.nl/2010/12/encodings-labels-tested describes  
some of the early research for single-byte encodings.

The indexes are derived primarily by reverse engineering. By letting a  
browser decode all valid byte sequences for that encoding and then store  
the result. For problematic indexes, such as the one for big5, additional  
research has been done on existing content, some of which is still  
ongoing. There has also been a ton awesome of feedback from the author of  
http://coq.no/character-tables/en who knows a ton about legacy encodings  
and their indexes.

> Some issues I'm wondering about, which could be resolved by looking at  
> good data: Is EUC-TW not used on the web, or so rarely that it's not  
> worth specifying?

 From what I recall it is not supported by Chrome and therefore not  
considered.

> Do browsers really only encode the characters they decode; don't they  
> ever try to map full-width ASCII to their plain ASCII equivalents, or  
> use other fallbacks, when encoding?

I have not tested encoders much yet. I've been focusing primarily on  
decoders thus far. Testing encoders is on my todo list and I guess I  
should write tests that iterate over all of Unicode and see how it gets  
encoded. I think ideally though they just use the mapping tables. The most  
common case of encoding, HTML forms, will encode the missing code points  
as character references.

> Do they reallyassume it's safe to encode all windows-1252 characters for  
> a form in a page labeled iso-8859-1?

It's required for compatibility with sites. The web platform is kind of  
illogical that way.

> Do UTF-8 decoders really still allow for 5-byte and 6-byte sequences?

Is there any utf-8 specification that says otherwise? You get U+FFFD, but  
the sequences are definitely supported.

Kind regards,

> [1] http://www.w3.org/International/track/actions/111
> [2] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Wednesday, 11 April 2012 07:25:02 UTC