Re: [Json] Encoding detection (Was: Re: JSON: remove gap between Ecma-404 and IETF draft) from Bjoern Hoehrmann on 2013-11-27 (www-tag@w3.org from November 2013)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 27 Nov 2013 01:14:00 +0100
To: Nico Williams <nico@cryptonector.com>
Cc: JSON WG <json@ietf.org>, www-tag <www-tag@w3.org>, es-discuss <es-discuss@mozilla.org>
Message-ID: <fbba99l1gugiq5idqm7h5afk1o0p5j3teq@hive.bjoern.hoehrmann.de>

* Nico Williams wrote:
>On Tue, Nov 26, 2013 at 09:15:38PM +0100, Bjoern Hoehrmann wrote:
>> * Nico Williams wrote:
>> >We must not require encoding detection functionality in parsers.  We
>> >must not forbid it either.  We might need to say that encodings other
>> >than UTF-8/16/32 may not be reliably detected, therefore they are highly
>> >discouraged, even forbidden except where protocols specifically call for
>> >them.
>> 
>> When I pass a fully conforming UTF-8 encoded application/json entity to
>> a fully conforming JSON parser I do not want the parser to do something
>> funny like interpreting the document as if it were Windows-1252 encoded.
>> I am amazed how many people here think a parser that does that should
>> not be considered broken.
>
>You missed the point.

"We must require encoding detection functionality in parsers. We must
forbid encoding detection functionality beyond that. We must say that
encodings other than UTF-8/16/32 are forbidden in any and all cases."
is how I would modify what you said above (with some caveats).

Note that I am talking about labeled sequences of octets, application/
json entities, not paintings on a cave wall that look similar to JSON
text in a strange font. In a labeled sequence of octets I can tell for
sure whether there are invisible characters in it if I know the en-
coding.

There are two forms to consider. One is the labeled sequence of octets
that we call "application/json entity". The other is a sequence of Uni-
code scalar values. That is the alphabet of the ABNF grammar in the
specification. If you have anything else, then the specification does
not apply to your situation.

>If you wanted to forbid non-Unicode, non-UTF encodings, then you'd be
>preventing such a shell, and for what reason?  If you only mean that
>auto-detection of encoding should not even be mentioned, I'm fine with
>that, and I've already said so earlier.

Above I said that there are two forms to consider. Encoding detection
is what allows us to convert the "application/json entity" form into
the "sequence of Unicode scalar values" form. We need the latter form
in order to apply the ABNF grammar. Imagine you receive this:

  HTTP/1.1 200 OK
  Content-Type: application/json
  ...

  ABCD...

There would be at least two specifications that apply here, the HTTP
and the application/json specification. Would you like them to say
that you are on your own, "ABCD..." could mean anything? I would like
them to say "ABCD..." is an array with three times the integer zero,
like `[0,0,0]`. I can build robust software based on that.

I cannot build robust software based on "well, maybe it's EBCDIC?
Have you tried GB 18030? UTF-7 might be worth a try otherwise. Are
you sure this matters at all?"
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Wednesday, 27 November 2013 00:14:28 UTC