Re: Feedback about the BOM article from Andrew Cunningham on 2012-12-18 (www-international@w3.org from October to December 2012)

From: Andrew Cunningham <andrewc@vicnet.net.au>
Date: Wed, 19 Dec 2012 01:41:08 +0700
To: Richard Ishida <ishida@w3.org>
Cc: Henri Sivonen <hsivonen@iki.fi>, www-international@w3.org
Message-ID: <CAGJ7U-WO_yDsDYYqVHLxKJsKMUUFE8jjoULYs7ObMJSaGP+aPA@mail.gmail.com>
I suppose I have got to the stage that I see the whole discussion of utf-8
and the BOM amusing.

And no one really seems to be addressing the white elephant in the room.

For UTF-8 the BOM has its uses but is pointless if the developers know what
they're doing.

I have used utf-8 exclussively since the late 1990s. The tools I use can
handle a utf-8 file with or without a BOM perfectly fine. The editing tools
I use indicate the encodingand the presence or absence of a BOM

Tools that require a BOM also tend to be tools that can't handle the
languages I work with.

I usually find the proplem is poor programming and development, not whether
a BOM is useful or not.

Ok many of you will disagree.

But I guess that the game is changing ...

But I wish discussion will turn to HTML5 and how to handle content when
Unicode is not sufficient for the task.

Whether that is for scripts that are not in unicode yet. And I can think of
two in tht category I am working with at the moment.

Or when text layout, font rendering in browsers is not up to the task of
rendering various scripts in unicode

Or the resurgency of psuedo-Unicode encodings and no way to distinguish
between unicode content and psuedo-unicode content.

Esp when mobile devices are becomimg a more prevalent way of accessing web
content in developing countries and will outstrip laptop and desktop access
while font rendering and font suport on mobile platforms is so primative.

There are hugh encoding issues ut there. And the BOM and its presence or
lack of it isn't a major issue. Just a distraction from more important
issues

Andrew
On 19/12/2012 12:57 AM, "Richard Ishida" <ishida@w3.org> wrote:

> Henri, thanks for the feedback. I should start out by saying that,
> although I made some changes recently, the version of the article that
> people have been providing feedback on was a very early draft where I threw
> in a bunch of ideas for consideration including some that were know to be
> 'at risk'. The draft leaked out through an initial discussion in the
> minutes, so a number of your comments hit areas that I was happy to abandon.
>
> More comments below...
>
> On 10/12/2012 16:16, Henri Sivonen wrote:
>
>> Feedback about http://www.w3.org/**International/questions/new/**
>> qa-byte-order-mark-new<http://www.w3.org/International/questions/new/qa-byte-order-mark-new>
>>
>> - -
>>
>> The “What is” intro should probably mention UTF-16 in order to explain
>> where the “byte order” part of the name comes from. However, I think
>> it would be the best to frame this as etymology arising from a legacy
>> encoding. That is, I think it should be made clear from the start that
>> UTF-16 should not be used for interchange. Something like:
>>
>> “Before UTF-8 was introduced in early 1993, the expected way for
>> transferring Unicode text was using 16-bit code units using an
>> encoding called UCS-2 which was later extended to UTF-16. 16-bit code
>> units can be expressed as bytes in two ways: the most significant byte
>> first (big endian) or the least significant byte first (little
>> endian). To communicate which byte order was in use, the stream was
>> started by writing U+FEFF (the code point for ZERO WIDTH NON-BREAKING
>> SPACE) at the start of the stream as magic number that is not
>> logically part of the text the stream represents.
>>
>> Even though UTF-8 proved to be a superior way of interchanging Unicode
>> text and UTF-8 didn't pose the issue of alternative byte orders,
>> U+FEFF can be still encoded as UTF-8 (resulting in the bytes 0xEF,
>> 0xBB, 0xBF) at the start of the stream in order to give UTF-8 a
>> recognizable magic number (encoding signature).”
>>
>
> I've been thinking for a while of doing just what you suggest, so I used
> some of your text. Thanks!
>
>
>> - -
>>
>> “since it is impossible to override manually”
>>
>> This is currently untrue in Firefox and Opera at least.
>>
>>
> Yes. Deleted.
>
>  - -
>>
>> “However, bear in mind that it is always a good idea to declare the
>> encoding of your page using the meta element, in addition to the BOM,
>> so that the encoding is apparent to people visually inspecting the
>> file. ”
>>
>> I disagree: Either the <meta> declaration is redundant or it is wrong
>> and misleads a person who is inspecting the file.
>>
>
> I agree that the meta is redundant for machines, but these articles are
> aimed at ordinary people, a huge majority of whom are pretty clueless about
> character encodings and don't have any kind of x-ray vision to be able to
> spot invisible things like the BOM. There are a number of situations where
> it can be useful to mortals to have a visual identification of the encoding
> used.
>
> For example, just today I viewed the source text of a page encoded in
> UTF-8 and using only the BOM to indicate the character encoding, and I
> copied the whole of the page to a blank file in my text editor and saved
> it. The BOM wasn't copied with it - so I no longer had any encoding
> information for that file. When I looked at it with the latest version of a
> non-mainstream browser, it was rendered as Latin1 with mojibake. If there
> had been a visible encoding declaration as well as the BOM, it would have
> displayed correctly after copying and pasting.
>
> I agree that in an ideal world all of this should just work, but it
> doesn't yet, and visible encoding labels can help keep things working in
> the meantime.
>
> We don't require it. We just recommend it.
>
>
>> - -
>>
>> “If you change the encoding of a UTF-8 file from a Unicode encoding to
>> a non-Unicode encoding, you must ensure that the BOM is removed.”
>>
>> This should remark that you should never want to change the encoding
>> away from UTF-8, so this is a non-issue in that sense. :-)
>>
>
> I agree. I reworked the text.
>
>
>
>> - -
>>
>> “If a page is originally in a Unicode encoding and the transcoder
>> switches the encoding to something else, such as Latin1, it will
>> usually indicate the new encoding by changing the information in the
>> HTTP header. The transcoder will typically not remove the byte-order
>> mark.”
>>
>> [citation needed]
>>
>
> We had already been discussing this in the WG. The original text was based
> on some discussions and assumptions from a long time back. It's now removed.
>
>
>> - -
>>
>> “In Internet Explorer 5.5 a BOM at the start of a file will cause the
>> page to be rendered in quirks mode”
>>
>> IE 5.5 only had the quirks mode. The first IE for Windows that
>> introduced a non-quirks mode was IE6. And in any case, it’s silly to
>> give advice about IE 5.5 in this day and age.
>>
>> As for interference in later IE, [citation needed].
>>
>
> This was a knee-jerk temporary edit to a comment from Leif, pending
> further thought. That subsection has now been removed.
>
>
>> - -
>>
>> “A UTF-8 signature at the beginning of a CSS file can sometimes cause
>> the initial rules in the file to fail on certain user agents.”
>>
>> [citation-needed]
>>
>
> Removed this section.
>
>
>> - -
>>
>> “The use of UTF-32 for HTML content, however, is strongly discouraged
>> and some implementations are removing support for it, so we haven't
>> even mentioned it until now.”
>>
>> Have removed support, rather.
>>
>
> Done.
>
>
> RI
>
>>
>>
>
Received on Tuesday, 18 December 2012 18:41:37 UTC