- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 22 Nov 2012 03:33:57 +0100
- To: www-international@w3.org
- Cc: Anne van Kesteren <annevk@annevk.nl>
Anne van Kesteren, Wed, 21 Nov 2012 22:04:22 +0100: > I saw http://www.w3.org/International/questions/new/qa-byte-order-mark-new > in the minutes. I have no objections to Anne’s comments. Especially that the BOM overrides anything else, is important. But instead of removing the warnings, perhaps you could say that, as of today, not yet all HTML UAs let the BOM override the HTTP. Also, of course, one should not encourage anyone to make BOM and HTTP disagree! Here are some comments of my own: (I.) While I often speak well about the BOM, I heard a good, critical comment from Martin, in the Unicode mailing list this summer: [1] "The problem with the BOM in UTF-8 is that it can be quite helpful (for quickly distinguishing between UTF-8 and legacy-encoded files) and quite damaging (for programs that use the Unix/Linux model of text processing), and that's why it creates so much controversy." This informative note would be a good statement to include, directly or edited - e.g. when you start to describe the problems of the BOM. (My hunch is, as well, that the "linux model of text processing" is ultimately one reason why PHP doesn't handle the BOM so well.) (II.) Positivity! The page tells much about disadvantages of the BOM. Could you please also describe some advantages to including the BOM? Speaking about the UTF-8 BOM, then those advantages are a) It is an UTF-8 _signature_ - thus it prevents the page from defaulting to to - well - the default encoding, b) It has effect in both XML/XHTML and HTML. c) It is small/short, d) It is very safe: Per Anne's Encoding spec - as well as implemented in IE (I have not tested released IE10), Webkit and (as promised by Henri) upcoming versions of Firefox (and since Anne wrote it, I must assume in Opera too), it is impossible to - by accident or otherwise - override the encoding of pages that include the BOM. NOTE: Accidental overriding can happens as a side effect of overriding the current page since HTML browsers - to various degree - remember manual encoding overriding also for other pages that you open in the same Tab/Window. If you like, you could as well add that these advantages are not as important for XML documents, since the ultimately defaults to UTF-8 anyhow. (III.) Under the subheading 'Quirks mode in Internet Explorer' (beneath 'Potential issues with the UTF-8 BOM'[2]), please replace 'Internet Explorer 6' with 'Internet Explorer 5.5'. (I verified - again - today, using the fine service as http://netrenderer.de.) (If one follows the link to the article on 'Serving HTML & XHTML', then you already makes clear that IE6 is _not_ affected:[3] "With Internet Explorer 6, however, if anything other than a byte-order mark appears before the DOCTYPE declaration the page is rendered in quirks mode." You should bring the new BOM article in alignment with that.) (IV.) Under the subheading 'Transcoding', it is said: "If you change the encoding of a UTF-8 file from a Unicode encoding to something else, you must ensure that the BOM is removed. If you don't either the browser will continue to treat your content as UTF-8, or you will see strange characters at the beginning of the page." Remarks. To say "If you change the encoding of a UTF-8 file from a Unicode encoding to something else", sounds strange, for various reasons: a) It is obvious that a 'UTF-8 file' is using a "Unicode encoding'. b) 'non-Unicode encoding' is better than 'something else'. Suggested reformulation: "If you change the encoding of a Unicode encoded file to a non-Unicode encoding, then …". (V.) Also, regarding the sentence that goes, quote: "You should also be aware that, although ASCII is a subset of UTF-8, a file that starts with a BOM is no longer ASCII-compatible." Here I would propose to change "a file that starts [etc]" with "an otherwise ASCII encoded file that starts with a BOM is no longer ASCII-compatible". But it is tempting to add that it can also be ADVANTAGE that the BOM this way makes the page ASCII-incompatible. Just imagine: A simple BOM, and voila, we are in Unicode land rather than in ISO-8859-1 land. Because ASCII is interpreted as ISO-8859-1 - and friends - on the Web. (Yes, if you declare the page to be ASCII, the browser still interprets it as Latin-1.) Thus, a ASCII encoded page on the Web is, strictly speaking, not ASCII-compatible! But for the BOM, it would - from that angle - be more ASCII-compatible if you *added* the BOM. This e.g. matters if the page accepts input form the user (via a form). Thus, essentially, we are back at the ADVANTAGES of the BOM. Strictly speaking, if the BOM creates a probllem with regard to ASCII-compatibility, then we are at the subject of *transcoding*, which should be a rare and academical rehearsal! See below. (VI.) Also, it seems like "Sometimes the encoding of a file is changed ('transcoded')" should be moved to right under the subheading 'Transcoding'. (VII.) And I think the Transcoding section could do well in dis-recommending to transcode Unicode/UTF-8 encoded documents. And thus, in that connection, you could add that section on transcoding relates to rare/academic situations. (VIII.) Btw, the current text also seems to pre-assume that the reader knows that he/she must - in addition to removing the BOM, *also* replace the BOM with a (correct) <meta> charset declaration etc. I think you should not pre-assume that! You do too much fuss out of the problems of the BOM here, I feel … [1] <http://www.unicode.org/mail-arch/unicode-ml/y2012-m07/0333.html> [2] <http://www.w3.org/International/questions/new/qa-byte-order-mark-new.en.php#problems> [3] <http://www.w3.org/International/articles/serving-xhtml/#declaration> -- leif halvard silli
Received on Thursday, 22 November 2012 02:34:27 UTC