Re: [Json] BOMs (Was: Re: JSON: remove gap between Ecma-404 and IETF draft)

From: Pete Cordell <petejson@codalogic.com> · Date: Mon, 18 Nov 2013 16:08:53 -0000

----- Original Message From: "Tim Bray" <tbray@textuality.com>
> This feels backward, because BOMs are actually useful for UTF-16 and
> UTF-32, but essentially useless for UTF-8.

Not useless if you're trying to tell the difference between a hand editted 
Windows cp-1252 (or whatever it's called) encoded text file and a UTF-8 
encoded text file.

I don't think we need them for any other reason, but I think some 
international Windows users would be thankful if you allowed them for that 
case.

On Mon, Nov 18, 2013 at 2:05 AM, Pete Cordell <petejson@codalogic.com>wrote:

> Given the history below, would it be sensible to accept BOMs for UTF-8
> encoding, but not for UTF-16 and UTF-32?  In other words, are BOMs needed
> and/or used in the wild for UTF-16 and UTF-32?
>
> Maybe the text can say something like "SHOULD accept BOMs for UTF-8, and
> MAY accept BOMs for UTF-16 and / or UTF-32"?
>
> Thanks,
>
> Pete Cordell
> Codalogic Ltd
> C++ tools for C++ programmers, http://codalogic.com
> Read & write XML in C++, http://www.xml2cpp.com
> ----- Original Message ----- From: ""Martin J. Dürst"" <
> duerst@it.aoyama.ac.jp>
> To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
> Cc: "John Cowan" <cowan@mercury.ccil.org>; "IETF Discussion"
> <ietf@ietf.org>; "Paul Hoffman" <paul.hoffman@vpnc.org>; "JSON WG"
> <json@ietf.org>; "Joe Hildebrand (jhildebr)" <jhildebr@cisco.com>; "Anne
> van
> Kesteren" <annevk@annevk.nl>; <www-tag@w3.org>; "es-discuss"
> <es-discuss@mozilla.org>
> Sent: Thursday, November 14, 2013 11:14 AM
> Subject: Re: [Json] JSON: remove gap between Ecma-404 and IETF draft
>
>
>  Hello Henry, others,
>>
>> On 2013/11/14 18:44, Henry S. Thompson wrote:
>>
>>> John Cowan writes:
>>>
>>>  Joe Hildebrand (jhildebr) scripsit:
>>>>
>>>>  If 404 doesn't allow [a BOM], I don't see a strong need to add it.
>>>>> Parsers can always be more forgiving of what they will parse than what
>>>>> the spec says, particularly since section 9 says "A JSON parser MAY
>>>>> accept non-JSON forms or extensions".
>>>>>
>>>>
>>>> It's not clear that 404 disallows it, since 404 is defined in terms of
>>>> characters, and a BOM is not a character but an out-of-band signal.
>>>>
>>>
>>> I think this is a crucial observation.
>>>
>>
>> Yes, and I think it's based on the experience with XML. But while this
>> experience may be applicable to JSON, Anne's original comment about the
>> BOM and XMLHttpRequest suggests that 404 actually currently does not
>> tolerate a BOM, and that implementations (except for XMLHttpRequest) also
>> don't.
>>
>> To give some historic background, the BOM for UTF-8 wasn't in the first
>> edition of XML (http://www.w3.org/TR/1998/REC-xml-19980210#sec-guessing).
>> It only later came in because Microsoft used it for notepad to be able to
>> quickly distinguish between UTF-8 and the legacy system encoding. Because
>> many people were writing some XML by hand, and some of them were using
>> notepad, the pressure on XML to accept a BOM at the start of an UTF-8 
>> file
>> mounted, and it was included in the second edition of the XML
>> Recommendation (http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing).
>>
>> Compared to XML, JSON may be much less edited by hand, or much less 
>> edited
>> on notepad, or otherwise just have a different history from XML, but we
>> have to make sure.
>>
>> Regards,   Martin.
>>
>>
>>  I note that XML approaches
>>> this problem in what might be a useful way.  The XML ABNF makes no
>>> mention of BOM, it's not part of any XML document as such.  But it
>>> _is_ allowed.  The relevant wording [1] is:
>>>
>>>    Entities ... may begin with the Byte Order Mark described by Annex H
>>>    of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH
>>>    NO-BREAK SPACE character, #xFEFF). _This is an encoding signature,_
>>>    _not part of either the markup or the character data of the XML_
>>>    _document._ XML processors must be able to use this character to
>>>    differentiate between UTF-8 and UTF-16 encoded documents. [emphasis
>>>    added]
>>>
>>> ht
>>>
>>> [1] http://www.w3.org/TR/REC-xml/#charencoding
>>>
>> _______________________________________________
>> json mailing list
>> json@ietf.org
>> https://www.ietf.org/mailman/listinfo/json
>>
>
> _______________________________________________
> json mailing list
> json@ietf.org
> https://www.ietf.org/mailman/listinfo/json
>