W3C home > Mailing lists > Public > whatwg@whatwg.org > June 2007

[whatwg] Internal character encoding declaration, Drop UTF-32, and UTF and BOM terminology

From: Michael A. Puls II <shadow2531@gmail.com>
Date: Sat, 23 Jun 2007 20:41:05 -0400
Message-ID: <6b9c91b20706231741m62a6480clf1237849530a3011@mail.gmail.com>
> On Sat, 11 Mar 2006, Henri Sivonen wrote:
> > The encoding labels with LE or BE in them mean BOMless variants where
> > the encoding label on the transfer protocol level gives the endianness.
> > See http://www.ietf.org/rfc/rfc2781.txt When the spec refers to UTF-16
> > with BOM in a particular endianness, I think the spec should use
> > "big-endian UTF-16" and "little-endian UTF-16".
> >
> > Since declaring endianness on the transfer protocol level has no benefit
> > over using the BOM when the label is right and there's a chance to get
> > the label wrong, the encoding labels with explicit endianness are
> > harmful for interchange. In my opinion, the spec should avoid giving
> > authors any bad ideas by reinforcing these labels by repetition.

FWIW, after reading the labeling part of the RFC again and adding your
suggestion, I came up with this:

big-endian UTF-16 = The big-endian encoding of UTF-16 with the BOM FEFF
little-endian UTF-16 = The little-endian encoding of UTF-16 with the BOM FFFE
UTF-16BE = The big-endian encoding of UTF-16 without the BOM
UTF-16LE = The little-endian encoding of UTF-16 without the BOM
UTF-16 = big-endian UTF-16 or little-endian UTF-16 or fallback to UTF-16BE

-- 
Michael
Received on Saturday, 23 June 2007 17:41:05 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:56 UTC