Re: Encoding detection (Was: Re: [Json] JSON: remove gap between Ecma-404 and IETF draft) from Henri Sivonen on 2013-11-20 (www-tag@w3.org from November 2013)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Wed, 20 Nov 2013 15:34:24 +0200
To: "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
Cc: Pete Cordell <petejson@codalogic.com>, Paul Hoffman <paul.hoffman@vpnc.org>, "www-tag@w3.org" <www-tag@w3.org>, JSON WG <json@ietf.org>
Message-ID: <CANXqsRJEtBoprQFrftz80ZigmBR_NHoEXK1sR4GyBtz5B2KC8Q@mail.gmail.com>

On Thu, Nov 14, 2013 at 4:59 PM, Joe Hildebrand (jhildebr)
<jhildebr@cisco.com> wrote:
>>   00 00 -- --  UTF-32BE
>>   00 xx -- --  UTF-16BE
>>   xx 00 00 00  UTF-32LE
>>   xx 00 00 xx  UTF-16LE
>>   xx 00 xx --  UTF-16LE
>>   xx xx -- --  UTF-8
>
> +1 to this table.  It's clear, correct, and implementable.

-1 ಠ_ಠ

As a person who has actually ended up (re)implementing many of the
cases where Firefox sets up the conversion bytes into characters, I
find this kind of format-specific making stuff up in order to enable
the use of BOMless UTF-16 or any sort of UTF-32 reprehensible.

There is no legitimate reason to use UTF-32 for interchange. UTF-32 as
an interchange encoding only serves to increase development and QA
cost and to potentially add security holes (all multibyte decoders
implemented in C or C++ are fine opportunities for security holes).
The W3C or the IETF should not act like UTF-32 was a legitimate
interchanging coding for any purpose for any format. Producers should
be prohibited from emitting UTF-32. Consumers should be prohibited
from supporting UTF-32.

As for UTF-16, there's no legitimate reason* for any new producer to
use it for interchange, but consumers might (depending on format)
need to implement it for compatibility with legacy producers, even
though in retrospect UTF-16 is a bad idea in general and a terrible
idea for interchange. I don't know if JSON is a format where it's
necessary to support the consumption of UTF-16 as an interchange
encoding, but if it is, the way detection happens for in-band
indications should happen in a way consistent with the "decode"
algorithm in the Encoding Standard
(http://encoding.spec.whatwg.org/#decode) for consistency with how the
Web Platform handles text/html, text/plain and text/css (and, shocker,
in practice XML). That is, there should be no special rules about
looking for patterns of zeros. However, the three BOMs—and only three
(UTF-8, big-endian UTF-16, little-endian UTF-16) BOMs—should take
precedence over everything else, including out-of-band metadata.

* gzip takes away the advantage UTF-16 might have over UTF-8 for East
Asian-heavy text.
-- 
Henri Sivonen
hsivonen@hsivonen.fi
http://hsivonen.fi/

Received on Wednesday, 20 November 2013 13:34:55 UTC