Re: BOMs from Martin J. Dürst on 2013-11-19 (www-tag@w3.org from November 2013)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 19 Nov 2013 20:09:30 +0900
To: "t.p." <daedulus@btconnect.com>
CC: John Cowan <cowan@mercury.ccil.org>, IETF Discussion <ietf@ietf.org>, Pete Cordell <petejson@codalogic.com>, JSON WG <json@ietf.org>, Anne van Kesteren <annevk@annevk.nl>, www-tag@w3.org, es-discuss <es-discuss@mozilla.org>
Message-ID: <528B46EA.4040503@it.aoyama.ac.jp>

On 2013/11/19 19:10, t.p. wrote:
> ----- Original Message -----
> From: "Martin J. Dürst"<duerst@it.aoyama.ac.jp>

>> For UTF-8, the BOM is not a Byte Order Mark, because such a mark isn't
>> necessary at all. It may serve as a signature, but is not necessary,
> and
>> in some circumstances counterproductive.
>
> Martin
>
> We had a similar discussion with syslog back in 2005, the issue being
> that UTF-8 was new and different and how to tell whether it was being
> used or not, and what made it into RFC5424 was
> "  If a syslog application encodes MSG in UTF-8, the string MUST start
>     with the Unicode byte order mask (BOM), which for UTF-8 is ABNF
>     %xEF.BB.BF.  "
> which remains a MUST to this day.  There are no relevant Errata.
>
> Tom Petch

This is something that seems to have made quite a lot of sense for 
syslog. I can understand that if before 2005, syslog was used with 
legacy encodings (iso-8859-1, Shift_JIS and similar), and there was 
otherwise no easy way to label the UTF-8 strings.

But another solution (for syslog, that is) would also have been 
possible. As John already pointed out, UTF-8 is very easy to detect 
heuristically: If a byte sequence follows the UTF-8 byte pattern, it's 
most definitely UTF-8 and not something else. For more background, 
please see http://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf, where 
that idea came up first.

As for JSON, it doesn't have the problem of legacy encodings. JSON by 
definition is encoded in an Unicode encoding form, and it's easy to 
distinguish these because of the restrictions on character sequences in 
JSON. And this can be done without a BOM (or with a BOM).

What's most important now is to know what receivers actually accept. We 
are not in a design phase, we are just updating the definition of JSON 
and making sure we fix problems if there are problems, but we have to 
use the installed base for the main guidance, not other protocols or 
formats.

Regards,   Martin.

Received on Tuesday, 19 November 2013 11:10:54 UTC