- From: Mihai Niță <mnita@google.com>
- Date: Mon, 18 Nov 2013 08:44:09 -0800
- To: derhoermi@gmx.net, ht@inf.ed.ac.uk
- Cc: IETF Discussion <ietf@ietf.org>, JSON WG <json@ietf.org>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, www-tag@w3.org, es-discuss <es-discuss@mozilla.org>
- Message-ID: <CAKj9SuMeMorkQn88QwqvcE9D8b8ss4xxkd4dLqAOLWjydRTPRw@mail.gmail.com>
I would add my two cents here. *Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used.* >From http://www.unicode.org/faq/utf_bom.html#bom1 And there is something in RFC 4627 that tells me JSON is not BOM-aware: ================== JSON text SHALL be encoded in Unicode. The default encoding is UTF-8. Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets. 00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8 ================== These patterns are not BOM, otherwise they would be something like this: 00 00 FE FF UTF-32BE FE FF xx xx UTF-16BE FF FE 00 00 UTF-32LE FF FE xx xx UTF-16LE EF BB BF xx UTF-8 It is kind of unfortunate that "the precise type of the data stream" is not determined, and BOM is not accepted. But a mechanism to decide the encoding is specified in the RFC, and it does not include a BOM, in fact it prevents the use of BOM (00 00 FE FF does not match the 00 00 00 xx pattern, for instance) So, "by the RFC", BOM is not expected / understood. ----- Although I am afraid that the RFC has a problem: I think "日本語" (U+0022 U+65E5 U+672C U+8A9E U+0022) is valid JSON (same as "foo"). The first four bytes are: 00 00 00 22 UTF-32BE 00 22 E5 65 UTF-16BE 22 00 00 00 UTF-32LE 22 00 65 E5 UTF-16LE 22 E6 97 A5 UTF-8 The UTF-16 bytes don't match the patterns in RFC, so UTF-16 streams would (wrongly) be detected as UTF-8, if one strictly follows the RFC. Regards, Mihai ====================================================== From: Bjoern Hoehrmann <derhoermi@gmx.net> To: ht@inf.ed.ac.uk (Henry S. Thompson) Cc: IETF Discussion <ietf@ietf.org>, JSON WG <json@ietf.org>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, www-tag@w3.org, es-discuss < es-discuss@mozilla.org> Date: Mon, 18 Nov 2013 14:48:19 +0100 Subject: Re: BOMs * Henry S. Thompson wrote: >I'm curious to know what level you're invoking the parser at. As >implied by my previous post about the Python 'requests' package, it >handles application/json resources by stripping any initial BOM it >finds -- you can try this with > >>>> import requests >>>> r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json") >>>> r.json() The Perl code was perl -MJSON -MEncode -e "my $s = encode_utf8(chr 0xFEFF) . '[]'; JSON->new->decode($s)" The Python code was import json json.loads(u"\uFEFF[]".encode('utf-8')) The Go code was package main import "encoding/json" import "fmt" func main() { r := "\uFEFF[]" var f interface{} err := json.Unmarshal([]byte(r), &f) fmt.Println(err) } In other words, always passing a UTF-8 encoded byte string to the byte string parsing part of the JSON implementation. RFC 4627 is the only specification for the application/json on-the-wire format and it does not mention anything about Unicode signatures. Looking for certain byte sequences at the beginning and treating them as a Unicode signature is the same as looking for `/* ... */` and treating it as a comment. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 19 November 2013 08:00:51 UTC