Encoding detection (Was: Re: [Json] JSON: remove gap between Ecma-404 and IETF draft)

Original Message From: "Joe Hildebrand" <hildjj@cursive.net>

> On 11/13/13 2:27 PM, "Paul Hoffman" <paul.hoffman@vpnc.org> wrote:
>
>><no hat>
>>
>>On Nov 13, 2013, at 12:24 PM, Joe Hildebrand (jhildebr)
>><jhildebr@cisco.com> wrote:
>>
>>> We would also need to change section 8.1 according to the mechanism that
>>> was previously proposed:
>>>
>>> 00 00 00 xx  UTF-32BE
>>>    00 xx ?? xx  UTF-16BE
>>>    xx 00 00 00  UTF-32LE
>>>    xx 00 xx ?? UTF-16LE
>>>    xx xx ?? ?? UTF-8
>>>
>>>
>>> in order to account for strings at the top level whose first character
>>>has
>>> a codepoint greater than 127.
>>
>>A string at the top level of a JSON text still needs to start with an
>>ASCII " character, so the logic is still fine, I believe.
>
>
> Without top level strings, the first *two* characters of any JSON text are
> always ASCII.  This:
>
>
> "?"  (that's U+0022 U+0100 U+0022)
>
> ...
>
> So the JSON text above would not match any of the table entries, causing
> an error.


 In http://www.ietf.org/mail-archive/web/json/current/msg00565.html I 
mentioned that we also need to allow for characters such as U+2c00 to be the 
first character in a quoted string.

This requires a pattern like:

    xx 00 00 xx  UTF-16LE

giving:

   00 00 00 xx  UTF-32BE
   00 xx 00 xx  UTF-16BE
   00 xx xx xx  UTF-16BE
   xx 00 00 00  UTF-32LE
   xx 00 00 xx  UTF-16LE
   xx 00 xx 00  UTF-16LE
   xx 00 xx xx  UTF-16LE
   xx xx xx xx  UTF-8

That can be reduced a bit if we use "--" to indicate "not-tested":

   00 00 -- --  UTF-32BE
   00 xx -- --  UTF-16BE
   xx 00 00 00  UTF-32LE
   xx 00 00 xx  UTF-16LE
   xx 00 xx --  UTF-16LE
   xx xx -- --  UTF-8


Pete Cordell
Codalogic Ltd
C++ tools for C++ programmers, http://codalogic.com
Read & write XML in C++, http://www.xml2cpp.com

Received on Friday, 15 November 2013 11:52:23 UTC