Re: [Json] Encoding detection (Was: Re: JSON: remove gap between Ecma-404 and IETF draft) from Henri Sivonen on 2013-11-27 (www-tag@w3.org from November 2013)

From: Henri Sivonen <hsivonen@hsivonen.fi>
Date: Wed, 27 Nov 2013 15:14:11 +0200
To: www-tag <www-tag@w3.org>
Cc: Pete Cordell <petejson@codalogic.com>, John Cowan <cowan@mercury.ccil.org>, Paul Hoffman <paul.hoffman@vpnc.org>, JSON WG <json@ietf.org>, "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>, es-discuss <es-discuss@mozilla.org>, tbray@textuality.com, Nico Williams <nico@cryptonector.com>
Message-ID: <CANXqsRLCjidF=RuUP8mQR8WWPm6RQ4bQkJTvjXk=hQwcKY7kCQ@mail.gmail.com>
On Tue, Nov 26, 2013 at 5:27 PM, Nico Williams <nico@cryptonector.com> wrote:
> I can't think of any either.  UTF-32 is superficially appealing (O(1)
> indexing!) but it's only O(1) indexing by codepoint counts, not
> character counts so it's still lame and you pay for longer strings.

It doesn't matter for the interchange encoding, which is a different
issue from in-RAM string representation (where indeed O(1) indexing by
code point is overrated but appeals to people who have heard about
astral characters but haven't yet considered considered that due to
combining characters, they can't treat treat code units independently
of the adjacent code units anyway).

> No, I think this is too much.  If someone wants to use UTF-32 because
> they have numbers showing that for IPC and local processing it's faster,
> that might be compelling; let them.

If someone wants to use UTF-32  for local shared-memory IPC, the
communicating parties are tightly-coupled  and don't need a standard
to authorize whatever they are doing. Standards are for
loosely-coupled communication where there aren't bilateral
arrangements between the communicating parties. (Also, UTF-32  doesn't
make sense even for local IPC  if it happens over a socket rather than
over shared memory.)

> Anyways, I think we're focusing too hard on details that aren't terribly
> important.  The "non-BOM-based sniffing rules" work and can be derived
> by any capable implementor whether stated or not by the RFC.

Has anyone tested  whether a substantial proportion of the existing
implementations already support the non-BOM-based sniffing rules? That
is,  can the rules be relied on with existing implementations?

>> I continue to strongly disapprove of non-BOM-based sniffing rules
>> unless there's compelling evidence that such rules are needed in order
>> to interoperate with bogus existing serializers.
>
> I think it's fair to object to requiring sniffing, and I support not
> requiring it.  I don't see anything wrong with leaving those in for
> those who want to include support for it.

Optional features are a trap. They are used to get the appearance of
consensus when people who are oppose to misfeatures are told that they
don't need to implement them. However,  when some implementations
start emitting syntax that exercises optional features (or someone
writes test cases just to smugly point out the lack of support for
optional features), everyone ends up having to implement optional
features in order to be compatible.

On Tue, Nov 26, 2013 at 6:01 PM, John Cowan <cowan@mercury.ccil.org> wrote:
> Henri Sivonen scripsit:
>
>> What sensible reasons could there possibly be?
>
> The fact that you (or even I) can't think of them doesn't mean they
> don't exist.

You and I have a pretty good idea of Unicode stuff. If we can't think
of sensible reasons and others in the discussion haven't seen UTF-32
(or UTF-16) JSON in the wild, either, we should have enough confidence
to say that UTF-8 is enough and not suggest that future implementors
waste implementation and QA effort on UTFs that don't make sense for
interchange.

The notion that in theory, maybe, someone might use a non-UTF-8
encoding for something, even if there's no data to support such
conjecture, is the source of a lot of harmful inertia around character
encodings.

> There would be no sensible reason for me to write to you in
> Finnish: your command of written English is near-native, and my command
> of Finnish is zero.  But it would be absurd of me to say that people
> should not communicate in Finnish because it harms interoperability.
> It so happens that I know that there are five million people cheerfully
> writing to each other in Finnish, under the impression that it is allowed.
> But even if I didn't happen to know that, the point would be the same.
>
> Now Finnish is a natural language, and JSON is not: it exists only by
> virtue of its definition.  But that definition explains how to communicate
> in JSON, and anyone who adheres to it is communicating correctly.  For us
> to chop their feet out from under them by saying that what they are doing
> does not count as JSON would be just as arbitrary as banning Finnish
> because almost nobody speaks it.  We can say that we think it's a bad idea
> to use non-UTF-8 encodings in JSON, and that's as far as we can justly go.

This is the bad analogy for several reasons:

 * Banning a particular layout of bits among equally expressive
alternatives is not the same thing as banning someone's native natural
language.

 * Finnish has uses other than test cases for the sake of test cases
or XSS exploits.

 * Communication in Finnish can actually be shown to happen in practice.

 * It would be unreasonable to suggest that everyone ought to support
the receipt of Finnish communication in addition to supporting the
receipt of English communication.

 * Accepting your analogy leads not only to supporting UTF-32 but also
to supporting various non-UTF encodings, which would be even worse
than suggesting that UTF-16 and UTF-32 be supported in addition to
UTF-8.

On Fri, Nov 22, 2013 at 6:39 PM, Tim Bray <tbray@textuality.com> wrote:
> I’ve been using JSON for quite a few years, but hardly ever in either a
> to-browser or from-browser role; what I care about is mostly its use in
> RESTful APIs generally and identity APIs specifically.  In those scenarios,
> it would be seen as wildly inappropriate to use anything but UTF-8; I’ve
> never actually seen anything else.  In practice, it would be very unlikely
> for anyone to deploy UTF-16 or any other non-UTF-8 flavor in a non-browser
> scenario.

Right.

> Having said that, I’m still, hundreds of messages later, not 100% sure what
> our draft should say about BOMs :(

When not considering JSON specifically, it is unusual to use BOMless
UTF-16 (unusual even if the use of UTF-16 is assumed; it is itself
unusual) and in the context of other textual formats, BOMless UTF-16
is a pain. It seems to me if support for UTF-16 (or UTF-32) is
retained, it would be good to test a bunch of existing implementations
(that section 11 of the draft mentions) to see how they behave with
BOMful and BOMless input.

Of course, such testing takes work, so banning UTF-16 and UTF-32 would
avoid such work. :-)

-- 
Henri Sivonen
hsivonen@hsivonen.fi
http://hsivonen.fi/
Received on Wednesday, 27 November 2013 13:14:45 UTC