Re: BOMs from Pete Cordell on 2013-11-18 (www-tag@w3.org from November 2013)

From: Pete Cordell <petejson@codalogic.com>
Date: Mon, 18 Nov 2013 13:36:13 -0000
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, "Henry S. Thompson" <ht@inf.ed.ac.uk>
Cc: "John Cowan" <cowan@mercury.ccil.org>, "IETF Discussion" <ietf@ietf.org>, "JSON WG" <json@ietf.org>, "Anne van Kesteren" <annevk@annevk.nl>, <www-tag@w3.org>, "es-discuss" <es-discuss@mozilla.org>
Message-ID: <F8C2334E1B3B4A63875ECFCD151726CC@codalogic>

----- Original Message ----- 
From: ""Martin J. Dürst"" <duerst@it.aoyama.ac.jp>
> On 2013/11/18 20:11, Henry S. Thompson wrote:
>> Pete Cordell writes:
>>
>>> Given the history below, would it be sensible to accept BOMs for UTF-8
>>> encoding, but not for UTF-16 and UTF-32?  In other words, are BOMs 
>>> needed
>>> and/or used in the wild for UTF-16 and UTF-32?
>>>
>>> Maybe the text can say something like "SHOULD accept BOMs for UTF-8,
>>> and MAY accept BOMs for UTF-16 and / or UTF-32"?
>>
>> My sense is that you'll see more UTF-16 BOMs than anything else.
>
> Yes indeed. BOM means Byte Order Mark. It's crucial for over-the-wire 
> UTF-16. (It's irrelevant for in-memory UTF-16, but that's not what we are 
> discussing.)

The in-memory case is not entirely irrelevant because a number of JSON 
messages will be constructed in memory and then squirted to line.

I did a little experiment with Visual Studio.  It will allow me to save in 
UTF-8 with or without a BOM (like thing).  Saving in UTF-16 (Or was it 
UCS2?) is always with a BOM.  There didn't seem to be a UTF-32 option.

JSON doesn't need BOMs.  However, there are cases where people might hand 
edit messages, and if they choose to save in UTF-16 they will likely have a 
BOM.

Is it acceptable to tell people not to save hand editted files in UTF-16, 
suggesting UTF-8 (possibly with an encoded BOM) as an alternative?

I would imagine that if someone did have a hand editted UTF-8 file on 
Windows then the allowance of a BOM would help their sanity immeasurably, 
but it's not something I have firsthand knowledge of.

I believe Unix/Linux works with UTF-8 without BOMs.  Is this the case?

Pete Cordell
Codalogic Ltd
C++ tools for C++ programmers, http://codalogic.com
Read & write XML in C++, http://www.xml2cpp.com

Received on Monday, 18 November 2013 13:36:21 UTC