- From: Anne van Kesteren <annevk@opera.com>
- Date: Wed, 11 Apr 2012 09:24:18 +0200
- To: "Norbert Lindenberg" <w3@norbertlindenberg.com>
- Cc: public-i18n-core@w3.org
Hi all, thanks for your email. On Tue, 10 Apr 2012 20:52:20 +0200, Norbert Lindenberg <w3@norbertlindenberg.com> wrote: > I'm writing to you on behalf of the W3C Internationalization Core WG > [1], which is currently looking at your Encoding specification [2]. > > We're wondering what sources you used to obtain the information in the > specification, such as the list of encodings, their aliases, and the > mapping tables for them. Is this derived from looking at the source code > of one or more browsers, or from testing their behavior, or from a web > index that shows which encodings are most commonly used on the web and > how? Encodings and labels are found by looking at IANA and reverse engineering browsers. http://wiki.whatwg.org/wiki/Web_Encodings has some of the early data. http://annevankesteren.nl/2010/12/encodings-labels-tested describes some of the early research for single-byte encodings. The indexes are derived primarily by reverse engineering. By letting a browser decode all valid byte sequences for that encoding and then store the result. For problematic indexes, such as the one for big5, additional research has been done on existing content, some of which is still ongoing. There has also been a ton awesome of feedback from the author of http://coq.no/character-tables/en who knows a ton about legacy encodings and their indexes. > Some issues I'm wondering about, which could be resolved by looking at > good data: Is EUC-TW not used on the web, or so rarely that it's not > worth specifying? From what I recall it is not supported by Chrome and therefore not considered. > Do browsers really only encode the characters they decode; don't they > ever try to map full-width ASCII to their plain ASCII equivalents, or > use other fallbacks, when encoding? I have not tested encoders much yet. I've been focusing primarily on decoders thus far. Testing encoders is on my todo list and I guess I should write tests that iterate over all of Unicode and see how it gets encoded. I think ideally though they just use the mapping tables. The most common case of encoding, HTML forms, will encode the missing code points as character references. > Do they reallyassume it's safe to encode all windows-1252 characters for > a form in a page labeled iso-8859-1? It's required for compatibility with sites. The web platform is kind of illogical that way. > Do UTF-8 decoders really still allow for 5-byte and 6-byte sequences? Is there any utf-8 specification that says otherwise? You get U+FFFD, but the sequences are definitely supported. Kind regards, > [1] http://www.w3.org/International/track/actions/111 > [2] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html -- Anne van Kesteren http://annevankesteren.nl/
Received on Wednesday, 11 April 2012 07:25:02 UTC