Feedback about the BOM article from Henri Sivonen on 2012-12-10 (www-international@w3.org from October to December 2012)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 10 Dec 2012 18:16:11 +0200
To: www-international@w3.org
Message-ID: <CAJQvAufc46n55yremxjz58Xom2hGS0_N=eTcwocjxxuUahXjHA@mail.gmail.com>
Feedback about http://www.w3.org/International/questions/new/qa-byte-order-mark-new

- -

The “What is” intro should probably mention UTF-16 in order to explain
where the “byte order” part of the name comes from. However, I think
it would be the best to frame this as etymology arising from a legacy
encoding. That is, I think it should be made clear from the start that
UTF-16 should not be used for interchange. Something like:

“Before UTF-8 was introduced in early 1993, the expected way for
transferring Unicode text was using 16-bit code units using an
encoding called UCS-2 which was later extended to UTF-16. 16-bit code
units can be expressed as bytes in two ways: the most significant byte
first (big endian) or the least significant byte first (little
endian). To communicate which byte order was in use, the stream was
started by writing U+FEFF (the code point for ZERO WIDTH NON-BREAKING
SPACE) at the start of the stream as magic number that is not
logically part of the text the stream represents.

Even though UTF-8 proved to be a superior way of interchanging Unicode
text and UTF-8 didn't pose the issue of alternative byte orders,
U+FEFF can be still encoded as UTF-8 (resulting in the bytes 0xEF,
0xBB, 0xBF) at the start of the stream in order to give UTF-8 a
recognizable magic number (encoding signature).”

- -

“since it is impossible to override manually”

This is currently untrue in Firefox and Opera at least.

- -

“However, bear in mind that it is always a good idea to declare the
encoding of your page using the meta element, in addition to the BOM,
so that the encoding is apparent to people visually inspecting the
file. ”

I disagree: Either the <meta> declaration is redundant or it is wrong
and misleads a person who is inspecting the file.

- -

“If you change the encoding of a UTF-8 file from a Unicode encoding to
a non-Unicode encoding, you must ensure that the BOM is removed.”

This should remark that you should never want to change the encoding
away from UTF-8, so this is a non-issue in that sense. :-)

- -

“If a page is originally in a Unicode encoding and the transcoder
switches the encoding to something else, such as Latin1, it will
usually indicate the new encoding by changing the information in the
HTTP header. The transcoder will typically not remove the byte-order
mark.”

[citation needed]

- -

“In Internet Explorer 5.5 a BOM at the start of a file will cause the
page to be rendered in quirks mode”

IE 5.5 only had the quirks mode. The first IE for Windows that
introduced a non-quirks mode was IE6. And in any case, it’s silly to
give advice about IE 5.5 in this day and age.

As for interference in later IE, [citation needed].

- -

“A UTF-8 signature at the beginning of a CSS file can sometimes cause
the initial rules in the file to fail on certain user agents.”

[citation-needed]

- -

“Note that, for HTML it's recommended that you use UTF-8 and that you
avoid UTF-16.”

To drive this point home, maybe mention that serving user-supplied
content as UTF-16 is an XSS risk:
http://hsivonen.iki.fi/test/moz/never-show-user-supplied-content-as-utf-16.htm

(Sure, browsers should disable the encoding menu to mitigate that
attack, but for the time being, the attack is possible.)

- -

“The use of UTF-32 for HTML content, however, is strongly discouraged
and some implementations are removing support for it, so we haven't
even mentioned it until now.”

Have removed support, rather.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Monday, 10 December 2012 16:16:41 UTC