Re: byte order mark article from Leif Halvard Silli on 2012-11-22 (www-international@w3.org from October to December 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 22 Nov 2012 03:59:24 +0100
To: www-international@w3.org
Message-ID: <20121122035924706708.212ef6bc@xn--mlform-iua.no>
As and additional point to (II.), I would propose that you change the 
title "Removing the BOM" to "Adding or removing the BOM" and rewrite 
that section accordingly.

Leif H Silli

Leif Halvard Silli, Thu, 22 Nov 2012 03:33:57 +0100:
> Anne van Kesteren, Wed, 21 Nov 2012 22:04:22 +0100:
>> I saw http://www.w3.org/International/questions/new/qa-byte-order-mark-new

>> in the minutes.
> 
> I have no objections to Anne’s comments. Especially that the BOM 
> overrides anything else, is important. But instead of removing the 
> warnings, perhaps you could say that, as of today, not yet all HTML UAs 
> let the BOM override the HTTP. Also, of course, one should not 
> encourage anyone to make BOM and HTTP disagree!
> 
>     Here are some comments of my own:
> 
> (I.) While I often speak well about the BOM, I heard a good, critical 
> comment from Martin, in the Unicode mailing list this summer: [1] 
> 
>   "The problem with the BOM in UTF-8 is that it can be quite
>    helpful (for quickly distinguishing between UTF-8 and
>    legacy-encoded files) and quite damaging (for programs that use
>    the Unix/Linux model of text processing), and that's why it 
>    creates so much controversy."
> 
>    This informative note would be a good statement to include, directly 
> or edited - e.g. when you start to describe the problems of the BOM. 
> (My hunch is, as well, that the "linux model of text processing" is 
> ultimately one reason why PHP doesn't handle the BOM so well.) 
> 
> (II.)  Positivity! The page tells much about disadvantages of the BOM. 
> Could you please also describe some advantages to including the BOM? 
> Speaking about the UTF-8 BOM, then those advantages are 
> 
>  a) It is an UTF-8 _signature_ - thus it prevents the page from
>        defaulting to to - well - the default encoding,
>  b) It has effect in both XML/XHTML and HTML.
>  c) It is small/short,
>  d) It is very safe: Per Anne's Encoding spec - as well as implemented 
> in IE (I have not tested released IE10), Webkit and (as promised by 
> Henri) upcoming versions of Firefox (and since Anne wrote it, I must 
> assume in Opera too), it is impossible to - by accident or otherwise - 
> override the encoding of pages that include the BOM. NOTE: Accidental 
> overriding can happens as a side effect of overriding the current page 
> since HTML browsers - to various degree - remember manual encoding 
> overriding also for other pages that you open in the same Tab/Window. 
> If you like, you could as well add that these advantages are not as 
> important for XML documents, since the ultimately defaults to UTF-8 
> anyhow.
> 
> (III.) Under the subheading 'Quirks mode in Internet Explorer' (beneath 
> 'Potential issues with the UTF-8 BOM'[2]), please replace 'Internet 
> Explorer 6' with 'Internet Explorer 5.5'. (I verified - again - today, 
> using the fine service as http://netrenderer.de.) (If one follows the 
> link to the article on 'Serving HTML & XHTML', then you already makes 
> clear that IE6 is _not_ affected:[3] "With Internet Explorer 6, 
> however, if anything other than a byte-order mark appears before the 
> DOCTYPE declaration the page is rendered in quirks mode." You should 
> bring the new BOM article in alignment with that.)
> 
> (IV.) Under the subheading 'Transcoding', it is said:
> 
>      "If you change the encoding of a UTF-8 file from a Unicode encoding
>       to something else, you must ensure that the BOM is removed.
> 
>       If you don't either the browser will continue to treat your 
> content
>       as UTF-8, or you will see strange characters at the beginning of
>       the page."
> 
>    Remarks. To say "If you change the encoding of a UTF-8 file from a
>        Unicode encoding to something else", sounds strange, for
>        various reasons:
>     a) It is obvious that a 'UTF-8 file' is using a "Unicode encoding'.
>     b) 'non-Unicode encoding' is better than 'something else'.
>        Suggested reformulation: "If you change the encoding of a
>        Unicode encoded file to a non-Unicode encoding, then …".
> 
> (V.)  Also, regarding the sentence that goes, quote: "You should also 
> be aware that, although ASCII is a subset of UTF-8, a file that starts 
> with a BOM is no longer ASCII-compatible." Here I would propose to 
> change "a file that starts [etc]" with "an otherwise ASCII encoded file 
> that starts with a BOM is no longer ASCII-compatible".
> 
>        But it is tempting to add that it can also be ADVANTAGE that the 
> BOM this way makes the page ASCII-incompatible. Just imagine: A simple 
> BOM, and voila, we are in Unicode land rather than in ISO-8859-1 land. 
> Because ASCII is interpreted as ISO-8859-1 - and friends - on the Web. 
> (Yes, if you declare the page to be ASCII, the browser still interprets 
> it as Latin-1.) Thus, a ASCII encoded page on the Web is, strictly 
> speaking, not ASCII-compatible! But for the BOM, it would - from that 
> angle - be more ASCII-compatible if you *added* the BOM. This e.g. 
> matters if the page accepts input form the user (via a form). Thus, 
> essentially, we are back at the ADVANTAGES of the BOM. Strictly 
> speaking, if the BOM creates a probllem with regard to 
> ASCII-compatibility, then we are at the subject of *transcoding*, which 
> should be a rare and academical rehearsal! See below.
> 
> (VI.) Also, it seems like "Sometimes the encoding of a file is changed 
> ('transcoded')" should be moved to right under the subheading 
> 'Transcoding'.
> 
> (VII.) And I think the Transcoding section could do well in 
> dis-recommending to transcode Unicode/UTF-8 encoded documents. And 
> thus, in that connection, you could add that section on transcoding 
> relates to rare/academic situations.
> 
> (VIII.) Btw, the current text also seems to pre-assume that the reader 
> knows that he/she must - in addition to removing the BOM, *also* 
> replace the BOM with a (correct) <meta> charset declaration etc. I 
> think you should not pre-assume that! You do too much fuss out of the 
> problems of the BOM here, I feel …
> 
> [1] <http://www.unicode.org/mail-arch/unicode-ml/y2012-m07/0333.html>
> [2] 
> 
<http://www.w3.org/International/questions/new/qa-byte-order-mark-new.en.php#problems>
> [3] 
> <http://www.w3.org/International/articles/serving-xhtml/#declaration>
> -- 
> leif halvard silli
Received on Thursday, 22 November 2012 02:59:53 UTC