Re: New full Unicode for ES6 idea

On Feb 19, 2012, at 1:34 PM, Brendan Eich wrote:

> Wes Garland wrote:
>> Is there a proposal for interaction with JSON?
> 
> From http://www.ietf.org/rfc/rfc4627, 2.5:
> 
>   To escape an extended character that is not in the Basic Multilingual
>   Plane, the character is represented as a twelve-character sequence,
>   encoding the UTF-16 surrogate pair.  So, for example, a string
>   containing only the G clef character (U+1D11E) may be represented as
>   "\uD834\uDD1E".

I think it is actually more complex than just the above.  2.5 also says:

"All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)." (emphasis added)

and 3. says:

"JSON text SHALL be encoded in Unicode.  The default encoding is UTF-8." and then goes on to talk about how to detect UTF-8, 16, and 32 LE and BE encodings.  So all those are legal.

It is presumably up a a JSON parser to decide how non-BMP characters in strings are encoded for whatever internal representation it is targeting.  Currently JS JSON.parse takes its input from a JavaScript string that is composed of 16-bit UCS-2 elements so there are no unencoded non-BMP characters in the string. However, according to the ES5.1 spec, JSON.parse (and JSON.stringify)  will just pass through any UTF-16 surrogate pairs that are encountered. 

With the BRS, JSON.parse and JSON.stringify could encounter non-BMP characters in the JS string it is processing and those also would presumably pass through transparently.  The one requirement of rfc 4627 that would be impacted by the BRS would be the 12-charcter escape sequences mentioned above.  Currently JSON.parse implementations encode those as UTF-16 surrogate pairs in the generated strings. If the BSR is flipped, the rfc seems to require that  they generate a single string element.  Because, the JSON.stringify spec. does not escape anything other than control characters, any non-BMP characters it encounter would pass through unencoded.   This implies that JSON.parse input of the form "\uD834\uDD1E" would probably round trip back out via JSON.stringify as JSON string containing the single unencoded G clef character.  Logically equivalent but not the identical JSON text.

Allen

Received on Monday, 20 February 2012 00:25:06 UTC