Re: Encoding detection from Pete Cordell on 2013-11-17 (www-tag@w3.org from November 2013)

From: Pete Cordell <petexmldev@codalogic.com>
Date: Sun, 17 Nov 2013 10:01:16 -0000
To: "Joe Hildebrand \(jhildebr\)" <jhildebr@cisco.com>, "Henry S. Thompson" <ht@inf.ed.ac.uk>
Cc: <www-tag@w3.org>, "JSON WG" <json@ietf.org>
Message-ID: <361AD700904D43B589DECD2959FBC5CF@codalogic>
There's some debate about whether BOMs are allowed (at least officially.  I 
think I would accommodate them if I were to implement a JSON parser.)

If I did handle BOMs I think I would adopt the "if it starts with a BOM..." 
approach similar to the Python method, and only resort to deduction based on 
assumption about ASCII characters if no BOM was found.  (I think BOM 
presence can be surmised by only looking at the first byte.  If it's greater 
than 0x80 - specifically if it's 0xef, 0xfe, 0xff - then it's an error if 
you don't go on to find a BOM.)

While I'm here, Joe mentioned "implementable".  From an implementer's 
perspective the table I presented earlier might be better presented in the 
following order:

   00 00 -- --  UTF-32BE
   00 xx -- --  UTF-16BE
   xx xx -- --  UTF-8
   xx 00 xx --  UTF-16LE
   xx 00 00 xx  UTF-16LE
   xx 00 00 00  UTF-32LE

Or is that taking the fun out of it for the implementer?!

Pete Cordell
Codalogic Ltd
C++ tools for C++ programmers, http://codalogic.com
Read & write XML in C++, http://www.xml2cpp.com
----- Original Message ----- 
From: "Henry S. Thompson" <ht@inf.ed.ac.uk>
To: "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>
Cc: "Pete Cordell" <petejson@codalogic.com>; "Paul Hoffman" 
<paul.hoffman@vpnc.org>; <www-tag@w3.org>; "JSON WG" <json@ietf.org>
Sent: Thursday, November 14, 2013 5:17 PM
Subject: Re: Encoding detection


Joe Hildebrand (jhildebr) writes:

> On 11/14/13 5:04 AM, "Pete Cordell" <petejson@codalogic.com> wrote:
>
>> In http://www.ietf.org/mail-archive/web/json/current/msg00565.html I
>>mentioned that we also need to allow for characters such as U+2c00 to be
>>the
>>first character in a quoted string.
>
> Ah, yes.  Sorry, I quoted from the wrong part of the conversation.  I
> completely agree.
>
>>That can be reduced a bit if we use "--" to indicate "not-tested":
>>
>>   00 00 -- --  UTF-32BE
>>   00 xx -- --  UTF-16BE
>>   xx 00 00 00  UTF-32LE
>>   xx 00 00 xx  UTF-16LE
>>   xx 00 xx --  UTF-16LE
>>   xx xx -- --  UTF-8
>
> +1 to this table.  It's clear, correct, and implementable.

Doesn't work if you want to allow a pre-document BOM.  But I think the
following enlarged grid does work, where "--" now means "not any
value(s) explicitly tested for with an 'earlier' shared prefix":

00 00 FE FF  UTF-32 (BE) w. BOM
[00 00 FF FE  UCS-4 w. BOM, unusual octet order (2143)]
00 00 -- --  UTF-32BE
00 xx -- --  UTF-16BE
FF FE 00 00  UTF-32 (LE) w. BOM
[FE FF 00 00  UCS-4 w. BOM, unusual octet order (3412)]
FE FF -- --  UTF-16 (BE) w. BOM
FF FE -- --  UTF-16 (LE) w. BOM
xx 00 00 00  UTF-32LE
xx 00 00 xx  UTF-16LE [Note that the following algorithm disagrees here]
xx 00 xx --  UTF-16LE
EF BB BF --  UTF-8 w. BOM
xx xx -- --  UTF-8

A possibly simpler algorithm, which has the same outcome I think,
minus the unusual octet cases and the exception noted above, is used
by the Python requests package [1] for JSON charset detection.
Schematically, this works as follows:

FF FE 00 00  UTF-32 (LE) w. BOM
00 00 FE FF  UTF-32 (BE) w. BOM
EF BB BF     UTF-8 w. BOM
FE FF        UTF-16 (BE) w. BOM
FF FE        UTF-16 (LE) w. BOM
Now count 00 bytes in first 4:
 0           UTF-8
 2
  00 -- 00 --   UTF-16 (BE)
  -- 00 -- 00   UTF-16 (LE)
 3
  00 00 00 --   UTF-32 (BE)
  -- 00 00 00   UTF-32 (LE)
 error

I _think_ requests is only correct because it assumes "JSON always
starts with two ASCII characters", depending on 4627, i.e.  continuing
to rule out e.g.

  "Ѐкземпияр"

To accommodate this case, we would need to add

 1
  -- 00 -- --   UTF-16 (LE)
 2
  -- 00 00 --   UTF-16 (LE)

to the requests algorithm.

(There are, it has to be said, few Unicode characters whose UTF-16-L
form is 00xx, i.e. U+xx00, the first code point on a code page -- I
had to hunt pretty hard to find the above specimen, which is in fact a
slight cheat :-) Many code pages have a gap at the 00 point.  Not sure
about the status of U+4E00, one variant of the ideograph for the
numeral 1).

ht

[1] http://docs.python-requests.org/en/latest/
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND    (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this    mail without it is forged 
spam]
Received on Sunday, 17 November 2013 10:01:45 UTC