RE: pre-HTML5 and the BOM from Leif Halvard Silli on 2012-07-18 (www-international@w3.org from July to September 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 19 Jul 2012 00:32:50 +0200
To: "Phillips, Addison" <addison@lab126.com>
Cc: Martin J. Dürst <duerst@it.aoyama.ac.jp>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <20120719003250351010.16a38ff7@xn--mlform-iua.no>
Phillips, Addison, Wed, 18 Jul 2012 09:56:56 -0700:

> I would also point out that the pages you've cited, in general, 
> continue to mark best practice on the Web as the Internationalization 
> WG understands it.

Noted.

> It is best to use a Unicode character encoding 
> (generally UTF-8).

Of course.

> It is better to use the character encoding 
> directly than it is to use escapes or entities.

Referring to the debate in the Unicode list, I propose to take into the 
document that, starting with HTML5, then - unless a BOM, HTTP header or 
<meta> declares the charset, then user agents are encouraged to run 
UTF-8 detection. Hence, if authors place non-ASCII at the top of a 
document - in comments, code or text - then they can detect that it is 
UTF-8. Chrome implements this already. And Firefox has it for some 
locales, AFAIK. And the point about UTF-8 detection was inserted, I 
think, at the initiative of the I18N group - and it points to Martin's 
'the promise of UTF-8' document.

What better method can there be to encourage authors to type non-ASCII 
than to tell them that there is a benefit and reward if they do so? (Of 
course, they should declare the encoding explicitly - via HTTP, <meta> 
and/or BOM - as well.)

> And it is best to 
> avoid the BOM when one has a choice.

I take the opposite approach: When one has a choice, it is best to 
include it.

It strikes me that, under a heading such as 'What I need to know about 
the BOM',[1] one should be able to find info about benefit of using the 
UTF-8 BOM. So I'd like to propose that the document should take in what 
the benefits are: reliable encoding detecting, impossible to override 
manually, extremely short, stable: works in XML, HTML, over file:// and 
when served via http://. 

If you still want to advice against it, you should tell authors - 
either how they can achieve the same without BOM - or why those effects 
are unimportant compared with others.

> We do need to remove 
> misinformation, such as the "three bytes of mojibake garbage" 
> discussion, as this is now obsolete when it comes to browsers.

I don't find that exact quote. Do you refer to this:

]] When the BOM is used in web pages or editors for UTF-8 encoded 
content it can sometimes introduce blank spaces or short sequences of 
strange-looking characters (such as ï»¿). For this reason, it is 
usually best for interoperability to omit the BOM, when given a choice, 
for UTF-8 content.[[

It is indeed obsolete information for browsers. But is it any less 
obsolete for editors? I mean, if author see 'ï»¿', then deleting the 
'ï»¿' would be like killing the messenger rather than fixing the 
problem: the editor. The situations when that is not the case must be 
extremely rare.

The document does not discuss PHP, as much as I can see. But elsewhere, 
the I18N group has discussed PHP and BOM.[2] I feel that that article 
mixes production problems and serving problems. To avoid production 
problems, one may want to remove the BOM, of course. But subsequently, 
the CMS could insert the BOM! E.g. in the PHP-based CMS I use for my 
blog, I have my CMS insert the BOM for me. This creates no problems for 
PHP of any version, AFAICT.
 
> Regarding:
> 
>>> [2]
>>> http://www.w3.org/International/questions/qa-byte-order-mark#bomhow

> 
> The WG discussed this document in our teleconference today, as it 
> happens [1], and work is already underway to update this page. 
> However, the WG still seems to feel that the Byte Order Mark is 
> better to avoid when possible, even if it is not the barrier to 
> display or interoperability that it once was. 

Then why?

> I do note that BOM and NCR/entities are (or at least should be) 
> separate considerations.

Why and how? In the sense that the BOM is an (invisible) encoding 
declaration/signature, the they are of course different considerations 
from that angle: If you add a BOM, then you have declared the encoding. 
But if you avoid character escaping then you have not. Even if you have 
increased the chance that the encoding is detected.

> Using a BOM as en encoding signature and 
> then escaping it is an absurd thing to do.

To escape the BOM? Yes, that would be absurd. Did you intend to say 
something else here? Or was that some kind of strawman?

> FWIW, I also agree with  Martin's comment:
> 
>>> I'm not sure there are many people for whom using named character 
>>> entities or numeric character references is a convenience. But for 
>>> those for whom it is a convenience, let them use it.

I am uncertain about what that comment means. In what context is it 
relevant? Do the I18N Group have any QA document where we may see the 
consequences of that comment reflected? Currently I only see the 
opposite advice - which you repeated above: That one should type 
characters directly.

> [1] http://www.w3.org/2012/07/18-i18n-minutes.html


[1] http://www.w3.org/International/questions/qa-byte-order-mark#bomhow

[2] http://www.w3.org/International/questions/qa-utf8-bom.en

-- 
Leif Halvard Silli
Received on Wednesday, 18 July 2012 22:33:39 UTC