Re: byte order mark article

Anne van Kesteren, Wed, 21 Nov 2012 22:04:22 +0100:
> I saw http://www.w3.org/International/questions/new/qa-byte-order-mark-new

> in the minutes.

I have no objections to Anne’s comments. Especially that the BOM 
overrides anything else, is important. But instead of removing the 
warnings, perhaps you could say that, as of today, not yet all HTML UAs 
let the BOM override the HTTP. Also, of course, one should not 
encourage anyone to make BOM and HTTP disagree!

    Here are some comments of my own:

(I.) While I often speak well about the BOM, I heard a good, critical 
comment from Martin, in the Unicode mailing list this summer: [1] 

  "The problem with the BOM in UTF-8 is that it can be quite
   helpful (for quickly distinguishing between UTF-8 and
   legacy-encoded files) and quite damaging (for programs that use
   the Unix/Linux model of text processing), and that's why it 
   creates so much controversy."

   This informative note would be a good statement to include, directly 
or edited - e.g. when you start to describe the problems of the BOM. 
(My hunch is, as well, that the "linux model of text processing" is 
ultimately one reason why PHP doesn't handle the BOM so well.) 

(II.)  Positivity! The page tells much about disadvantages of the BOM. 
Could you please also describe some advantages to including the BOM? 
Speaking about the UTF-8 BOM, then those advantages are 

 a) It is an UTF-8 _signature_ - thus it prevents the page from
       defaulting to to - well - the default encoding,
 b) It has effect in both XML/XHTML and HTML.
 c) It is small/short,
 d) It is very safe: Per Anne's Encoding spec - as well as implemented 
in IE (I have not tested released IE10), Webkit and (as promised by 
Henri) upcoming versions of Firefox (and since Anne wrote it, I must 
assume in Opera too), it is impossible to - by accident or otherwise - 
override the encoding of pages that include the BOM. NOTE: Accidental 
overriding can happens as a side effect of overriding the current page 
since HTML browsers - to various degree - remember manual encoding 
overriding also for other pages that you open in the same Tab/Window. 
If you like, you could as well add that these advantages are not as 
important for XML documents, since the ultimately defaults to UTF-8 
anyhow.

(III.) Under the subheading 'Quirks mode in Internet Explorer' (beneath 
'Potential issues with the UTF-8 BOM'[2]), please replace 'Internet 
Explorer 6' with 'Internet Explorer 5.5'. (I verified - again - today, 
using the fine service as http://netrenderer.de.) (If one follows the 
link to the article on 'Serving HTML & XHTML', then you already makes 
clear that IE6 is _not_ affected:[3] "With Internet Explorer 6, 
however, if anything other than a byte-order mark appears before the 
DOCTYPE declaration the page is rendered in quirks mode." You should 
bring the new BOM article in alignment with that.)

(IV.) Under the subheading 'Transcoding', it is said:

     "If you change the encoding of a UTF-8 file from a Unicode encoding
      to something else, you must ensure that the BOM is removed.

      If you don't either the browser will continue to treat your 
content
      as UTF-8, or you will see strange characters at the beginning of
      the page."

   Remarks. To say "If you change the encoding of a UTF-8 file from a
       Unicode encoding to something else", sounds strange, for
       various reasons:
    a) It is obvious that a 'UTF-8 file' is using a "Unicode encoding'.
    b) 'non-Unicode encoding' is better than 'something else'.
       Suggested reformulation: "If you change the encoding of a
       Unicode encoded file to a non-Unicode encoding, then …".

(V.)  Also, regarding the sentence that goes, quote: "You should also 
be aware that, although ASCII is a subset of UTF-8, a file that starts 
with a BOM is no longer ASCII-compatible." Here I would propose to 
change "a file that starts [etc]" with "an otherwise ASCII encoded file 
that starts with a BOM is no longer ASCII-compatible".

       But it is tempting to add that it can also be ADVANTAGE that the 
BOM this way makes the page ASCII-incompatible. Just imagine: A simple 
BOM, and voila, we are in Unicode land rather than in ISO-8859-1 land. 
Because ASCII is interpreted as ISO-8859-1 - and friends - on the Web. 
(Yes, if you declare the page to be ASCII, the browser still interprets 
it as Latin-1.) Thus, a ASCII encoded page on the Web is, strictly 
speaking, not ASCII-compatible! But for the BOM, it would - from that 
angle - be more ASCII-compatible if you *added* the BOM. This e.g. 
matters if the page accepts input form the user (via a form). Thus, 
essentially, we are back at the ADVANTAGES of the BOM. Strictly 
speaking, if the BOM creates a probllem with regard to 
ASCII-compatibility, then we are at the subject of *transcoding*, which 
should be a rare and academical rehearsal! See below.

(VI.) Also, it seems like "Sometimes the encoding of a file is changed 
('transcoded')" should be moved to right under the subheading 
'Transcoding'.

(VII.) And I think the Transcoding section could do well in 
dis-recommending to transcode Unicode/UTF-8 encoded documents. And 
thus, in that connection, you could add that section on transcoding 
relates to rare/academic situations.

(VIII.) Btw, the current text also seems to pre-assume that the reader 
knows that he/she must - in addition to removing the BOM, *also* 
replace the BOM with a (correct) <meta> charset declaration etc. I 
think you should not pre-assume that! You do too much fuss out of the 
problems of the BOM here, I feel …

[1] <http://www.unicode.org/mail-arch/unicode-ml/y2012-m07/0333.html>
[2] 
<http://www.w3.org/International/questions/new/qa-byte-order-mark-new.en.php#problems>
[3] 
<http://www.w3.org/International/articles/serving-xhtml/#declaration>
-- 
leif halvard silli

Received on Thursday, 22 November 2012 02:34:27 UTC