- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Thu, 14 Nov 2013 17:17:23 +0000
- To: "Joe Hildebrand \(jhildebr\)" <jhildebr@cisco.com>
- Cc: Pete Cordell <petejson@codalogic.com>, Paul Hoffman <paul.hoffman@vpnc.org>, "www-tag\@w3.org" <www-tag@w3.org>, JSON WG <json@ietf.org>
Joe Hildebrand (jhildebr) writes: > On 11/14/13 5:04 AM, "Pete Cordell" <petejson@codalogic.com> wrote: > >> In http://www.ietf.org/mail-archive/web/json/current/msg00565.html I >>mentioned that we also need to allow for characters such as U+2c00 to be >>the >>first character in a quoted string. > > Ah, yes. Sorry, I quoted from the wrong part of the conversation. I > completely agree. > >>That can be reduced a bit if we use "--" to indicate "not-tested": >> >> 00 00 -- -- UTF-32BE >> 00 xx -- -- UTF-16BE >> xx 00 00 00 UTF-32LE >> xx 00 00 xx UTF-16LE >> xx 00 xx -- UTF-16LE >> xx xx -- -- UTF-8 > > +1 to this table. It's clear, correct, and implementable. Doesn't work if you want to allow a pre-document BOM. But I think the following enlarged grid does work, where "--" now means "not any value(s) explicitly tested for with an 'earlier' shared prefix": 00 00 FE FF UTF-32 (BE) w. BOM [00 00 FF FE UCS-4 w. BOM, unusual octet order (2143)] 00 00 -- -- UTF-32BE 00 xx -- -- UTF-16BE FF FE 00 00 UTF-32 (LE) w. BOM [FE FF 00 00 UCS-4 w. BOM, unusual octet order (3412)] FE FF -- -- UTF-16 (BE) w. BOM FF FE -- -- UTF-16 (LE) w. BOM xx 00 00 00 UTF-32LE xx 00 00 xx UTF-16LE [Note that the following algorithm disagrees here] xx 00 xx -- UTF-16LE EF BB BF -- UTF-8 w. BOM xx xx -- -- UTF-8 A possibly simpler algorithm, which has the same outcome I think, minus the unusual octet cases and the exception noted above, is used by the Python requests package [1] for JSON charset detection. Schematically, this works as follows: FF FE 00 00 UTF-32 (LE) w. BOM 00 00 FE FF UTF-32 (BE) w. BOM EF BB BF UTF-8 w. BOM FE FF UTF-16 (BE) w. BOM FF FE UTF-16 (LE) w. BOM Now count 00 bytes in first 4: 0 UTF-8 2 00 -- 00 -- UTF-16 (BE) -- 00 -- 00 UTF-16 (LE) 3 00 00 00 -- UTF-32 (BE) -- 00 00 00 UTF-32 (LE) error I _think_ requests is only correct because it assumes "JSON always starts with two ASCII characters", depending on 4627, i.e. continuing to rule out e.g. "Ѐкземпияр" To accommodate this case, we would need to add 1 -- 00 -- -- UTF-16 (LE) 2 -- 00 00 -- UTF-16 (LE) to the requests algorithm. (There are, it has to be said, few Unicode characters whose UTF-16-L form is 00xx, i.e. U+xx00, the first code point on a code page -- I had to hunt pretty hard to find the above specimen, which is in fact a slight cheat :-) Many code pages have a gap at the 00 point. Not sure about the status of U+4E00, one variant of the ideograph for the numeral 1). ht [1] http://docs.python-requests.org/en/latest/ -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this mail without it is forged spam]
Received on Thursday, 14 November 2013 17:18:11 UTC