Re: Sources for Encoding specification

Hi Anne,

Thanks for your reply - it's clear that a lot of work and thinking has already gone into this!

A few notes are inserted below. You'll probably hear more from us soon.

Best regards,
Norbert



On Apr 11, 2012, at 0:24 , Anne van Kesteren wrote:

> Hi all, thanks for your email.
> 
> On Tue, 10 Apr 2012 20:52:20 +0200, Norbert Lindenberg <w3@norbertlindenberg.com> wrote:
>> I'm writing to you on behalf of the W3C Internationalization Core WG [1], which is currently looking at your Encoding specification [2].
>> 
>> We're wondering what sources you used to obtain the information in the specification, such as the list of encodings, their aliases, and the mapping tables for them. Is this derived from looking at the source code of one or more browsers, or from testing their behavior, or from a web index that shows which encodings are most commonly used on the web and how?
> 
> Encodings and labels are found by looking at IANA and reverse engineering browsers. http://wiki.whatwg.org/wiki/Web_Encodings has some of the early data. http://annevankesteren.nl/2010/12/encodings-labels-tested describes some of the early research for single-byte encodings.
> 
> The indexes are derived primarily by reverse engineering. By letting a browser decode all valid byte sequences for that encoding and then store the result. For problematic indexes, such as the one for big5, additional research has been done on existing content, some of which is still ongoing. There has also been a ton awesome of feedback from the author of http://coq.no/character-tables/en who knows a ton about legacy encodings and their indexes.
> 
> 
>> Some issues I'm wondering about, which could be resolved by looking at good data: Is EUC-TW not used on the web, or so rarely that it's not worth specifying?
> 
> From what I recall it is not supported by Chrome and therefore not considered.

A spec on encoding handling for the web should probably focus on those encodings that are most commonly used on the web. Mark Davis sometimes publishes data in that area; he may be able to provide more detail.
http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html
What browsers currently support may be influenced by which libraries they use, and the libraries may have accumulated encodings that aren't relevant to the web.

>> Do browsers really only encode the characters they decode; don't they ever try to map full-width ASCII to their plain ASCII equivalents, or use other fallbacks, when encoding?
> 
> I have not tested encoders much yet. I've been focusing primarily on decoders thus far. Testing encoders is on my todo list and I guess I should write tests that iterate over all of Unicode and see how it gets encoded. I think ideally though they just use the mapping tables. The most common case of encoding, HTML forms, will encode the missing code points as character references.
> 
> 
>> Do they really assume it's safe to encode all windows-1252 characters for a form in a page labeled iso-8859-1?
> 
> It's required for compatibility with sites. The web platform is kind of illogical that way.
> 
> 
>> Do UTF-8 decoders really still allow for 5-byte and 6-byte sequences?
> 
> Is there any utf-8 specification that says otherwise? You get U+FFFD, but the sequences are definitely supported.

The UTF-8 specification (in the Unicode Standard, in ISO 10646, in RFC 3629) was updated years ago to only allow sequences up to four bytes. But I suppose it doesn't really matter whether a sequence of five or six bytes is allowed and maps to U+FFFD because it's above U+10FFFF, or it's treated as an error directly and replaced with U+FFFD...

> Kind regards,
> 
> 
>> [1] http://www.w3.org/International/track/actions/111
>> [2] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html
> 
> 
> -- 
> Anne van Kesteren
> http://annevankesteren.nl/
> 

Received on Wednesday, 18 April 2012 06:15:55 UTC