- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Mon, 10 Dec 2012 18:16:11 +0200
- To: www-international@w3.org
Feedback about http://www.w3.org/International/questions/new/qa-byte-order-mark-new - - The “What is” intro should probably mention UTF-16 in order to explain where the “byte order” part of the name comes from. However, I think it would be the best to frame this as etymology arising from a legacy encoding. That is, I think it should be made clear from the start that UTF-16 should not be used for interchange. Something like: “Before UTF-8 was introduced in early 1993, the expected way for transferring Unicode text was using 16-bit code units using an encoding called UCS-2 which was later extended to UTF-16. 16-bit code units can be expressed as bytes in two ways: the most significant byte first (big endian) or the least significant byte first (little endian). To communicate which byte order was in use, the stream was started by writing U+FEFF (the code point for ZERO WIDTH NON-BREAKING SPACE) at the start of the stream as magic number that is not logically part of the text the stream represents. Even though UTF-8 proved to be a superior way of interchanging Unicode text and UTF-8 didn't pose the issue of alternative byte orders, U+FEFF can be still encoded as UTF-8 (resulting in the bytes 0xEF, 0xBB, 0xBF) at the start of the stream in order to give UTF-8 a recognizable magic number (encoding signature).” - - “since it is impossible to override manually” This is currently untrue in Firefox and Opera at least. - - “However, bear in mind that it is always a good idea to declare the encoding of your page using the meta element, in addition to the BOM, so that the encoding is apparent to people visually inspecting the file. ” I disagree: Either the <meta> declaration is redundant or it is wrong and misleads a person who is inspecting the file. - - “If you change the encoding of a UTF-8 file from a Unicode encoding to a non-Unicode encoding, you must ensure that the BOM is removed.” This should remark that you should never want to change the encoding away from UTF-8, so this is a non-issue in that sense. :-) - - “If a page is originally in a Unicode encoding and the transcoder switches the encoding to something else, such as Latin1, it will usually indicate the new encoding by changing the information in the HTTP header. The transcoder will typically not remove the byte-order mark.” [citation needed] - - “In Internet Explorer 5.5 a BOM at the start of a file will cause the page to be rendered in quirks mode” IE 5.5 only had the quirks mode. The first IE for Windows that introduced a non-quirks mode was IE6. And in any case, it’s silly to give advice about IE 5.5 in this day and age. As for interference in later IE, [citation needed]. - - “A UTF-8 signature at the beginning of a CSS file can sometimes cause the initial rules in the file to fail on certain user agents.” [citation-needed] - - “Note that, for HTML it's recommended that you use UTF-8 and that you avoid UTF-16.” To drive this point home, maybe mention that serving user-supplied content as UTF-16 is an XSS risk: http://hsivonen.iki.fi/test/moz/never-show-user-supplied-content-as-utf-16.htm (Sure, browsers should disable the encoding menu to mitigate that attack, but for the time being, the attack is possible.) - - “The use of UTF-32 for HTML content, however, is strongly discouraged and some implementations are removing support for it, so we haven't even mentioned it until now.” Have removed support, rather. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Monday, 10 December 2012 16:16:41 UTC