- From: Henri Sivonen <hsivonen@hsivonen.fi>
- Date: Wed, 20 Nov 2013 15:34:24 +0200
- To: "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
- Cc: Pete Cordell <petejson@codalogic.com>, Paul Hoffman <paul.hoffman@vpnc.org>, "www-tag@w3.org" <www-tag@w3.org>, JSON WG <json@ietf.org>
On Thu, Nov 14, 2013 at 4:59 PM, Joe Hildebrand (jhildebr) <jhildebr@cisco.com> wrote: >> 00 00 -- -- UTF-32BE >> 00 xx -- -- UTF-16BE >> xx 00 00 00 UTF-32LE >> xx 00 00 xx UTF-16LE >> xx 00 xx -- UTF-16LE >> xx xx -- -- UTF-8 > > +1 to this table. It's clear, correct, and implementable. -1 ಠ_ಠ As a person who has actually ended up (re)implementing many of the cases where Firefox sets up the conversion bytes into characters, I find this kind of format-specific making stuff up in order to enable the use of BOMless UTF-16 or any sort of UTF-32 reprehensible. There is no legitimate reason to use UTF-32 for interchange. UTF-32 as an interchange encoding only serves to increase development and QA cost and to potentially add security holes (all multibyte decoders implemented in C or C++ are fine opportunities for security holes). The W3C or the IETF should not act like UTF-32 was a legitimate interchanging coding for any purpose for any format. Producers should be prohibited from emitting UTF-32. Consumers should be prohibited from supporting UTF-32. As for UTF-16, there's no legitimate reason* for any new producer to use it for interchange, but consumers might (depending on format) need to implement it for compatibility with legacy producers, even though in retrospect UTF-16 is a bad idea in general and a terrible idea for interchange. I don't know if JSON is a format where it's necessary to support the consumption of UTF-16 as an interchange encoding, but if it is, the way detection happens for in-band indications should happen in a way consistent with the "decode" algorithm in the Encoding Standard (http://encoding.spec.whatwg.org/#decode) for consistency with how the Web Platform handles text/html, text/plain and text/css (and, shocker, in practice XML). That is, there should be no special rules about looking for patterns of zeros. However, the three BOMs—and only three (UTF-8, big-endian UTF-16, little-endian UTF-16) BOMs—should take precedence over everything else, including out-of-band metadata. * gzip takes away the advantage UTF-16 might have over UTF-8 for East Asian-heavy text. -- Henri Sivonen hsivonen@hsivonen.fi http://hsivonen.fi/
Received on Wednesday, 20 November 2013 13:34:55 UTC