Re: Feedback about the BOM article from Richard Ishida on 2012-12-18 (www-international@w3.org from October to December 2012)

From: Richard Ishida <ishida@w3.org>
Date: Tue, 18 Dec 2012 17:54:31 +0000
To: Henri Sivonen <hsivonen@iki.fi>
CC: www-international@w3.org
Message-ID: <50D0ADD7.9030805@w3.org>
Henri, thanks for the feedback. I should start out by saying that, 
although I made some changes recently, the version of the article that 
people have been providing feedback on was a very early draft where I 
threw in a bunch of ideas for consideration including some that were 
know to be 'at risk'. The draft leaked out through an initial discussion 
in the minutes, so a number of your comments hit areas that I was happy 
to abandon.

More comments below...

On 10/12/2012 16:16, Henri Sivonen wrote:
> Feedback about http://www.w3.org/International/questions/new/qa-byte-order-mark-new
>
> - -
>
> The “What is” intro should probably mention UTF-16 in order to explain
> where the “byte order” part of the name comes from. However, I think
> it would be the best to frame this as etymology arising from a legacy
> encoding. That is, I think it should be made clear from the start that
> UTF-16 should not be used for interchange. Something like:
>
> “Before UTF-8 was introduced in early 1993, the expected way for
> transferring Unicode text was using 16-bit code units using an
> encoding called UCS-2 which was later extended to UTF-16. 16-bit code
> units can be expressed as bytes in two ways: the most significant byte
> first (big endian) or the least significant byte first (little
> endian). To communicate which byte order was in use, the stream was
> started by writing U+FEFF (the code point for ZERO WIDTH NON-BREAKING
> SPACE) at the start of the stream as magic number that is not
> logically part of the text the stream represents.
>
> Even though UTF-8 proved to be a superior way of interchanging Unicode
> text and UTF-8 didn't pose the issue of alternative byte orders,
> U+FEFF can be still encoded as UTF-8 (resulting in the bytes 0xEF,
> 0xBB, 0xBF) at the start of the stream in order to give UTF-8 a
> recognizable magic number (encoding signature).”

I've been thinking for a while of doing just what you suggest, so I used 
some of your text. Thanks!

>
> - -
>
> “since it is impossible to override manually”
>
> This is currently untrue in Firefox and Opera at least.
>

Yes. Deleted.

> - -
>
> “However, bear in mind that it is always a good idea to declare the
> encoding of your page using the meta element, in addition to the BOM,
> so that the encoding is apparent to people visually inspecting the
> file. ”
>
> I disagree: Either the <meta> declaration is redundant or it is wrong
> and misleads a person who is inspecting the file.

I agree that the meta is redundant for machines, but these articles are 
aimed at ordinary people, a huge majority of whom are pretty clueless 
about character encodings and don't have any kind of x-ray vision to be 
able to spot invisible things like the BOM. There are a number of 
situations where it can be useful to mortals to have a visual 
identification of the encoding used.

For example, just today I viewed the source text of a page encoded in 
UTF-8 and using only the BOM to indicate the character encoding, and I 
copied the whole of the page to a blank file in my text editor and saved 
it. The BOM wasn't copied with it - so I no longer had any encoding 
information for that file. When I looked at it with the latest version 
of a non-mainstream browser, it was rendered as Latin1 with mojibake. If 
there had been a visible encoding declaration as well as the BOM, it 
would have displayed correctly after copying and pasting.

I agree that in an ideal world all of this should just work, but it 
doesn't yet, and visible encoding labels can help keep things working in 
the meantime.

We don't require it. We just recommend it.

>
> - -
>
> “If you change the encoding of a UTF-8 file from a Unicode encoding to
> a non-Unicode encoding, you must ensure that the BOM is removed.”
>
> This should remark that you should never want to change the encoding
> away from UTF-8, so this is a non-issue in that sense. :-)

I agree. I reworked the text.


>
> - -
>
> “If a page is originally in a Unicode encoding and the transcoder
> switches the encoding to something else, such as Latin1, it will
> usually indicate the new encoding by changing the information in the
> HTTP header. The transcoder will typically not remove the byte-order
> mark.”
>
> [citation needed]

We had already been discussing this in the WG. The original text was 
based on some discussions and assumptions from a long time back. It's 
now removed.

>
> - -
>
> “In Internet Explorer 5.5 a BOM at the start of a file will cause the
> page to be rendered in quirks mode”
>
> IE 5.5 only had the quirks mode. The first IE for Windows that
> introduced a non-quirks mode was IE6. And in any case, it’s silly to
> give advice about IE 5.5 in this day and age.
>
> As for interference in later IE, [citation needed].

This was a knee-jerk temporary edit to a comment from Leif, pending 
further thought. That subsection has now been removed.

>
> - -
>
> “A UTF-8 signature at the beginning of a CSS file can sometimes cause
> the initial rules in the file to fail on certain user agents.”
>
> [citation-needed]

Removed this section.

>
> - -
>
> “The use of UTF-32 for HTML content, however, is strongly discouraged
> and some implementations are removing support for it, so we haven't
> even mentioned it until now.”
>
> Have removed support, rather.

Done.


RI
>
Received on Tuesday, 18 December 2012 17:55:15 UTC