[Bug 13392] i18n-ISSUE-72: BOM as preferred encoding declaration from bugzilla@jessica.w3.org on 2011-08-01 (public-i18n-core@w3.org from July to September 2011)

From: <bugzilla@jessica.w3.org>
Date: Mon, 01 Aug 2011 14:03:13 +0000
To: public-i18n-core@w3.org
Message-Id: <E1Qnt5N-0007ep-BU@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=13392

--- Comment #12 from Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> 2011-08-01 14:03:11 UTC ---
COMPROMISE PROPOSAL:

* The text "(preferred)" was Eliot's addition which, however, was quite
compatible with the arguments I presented along with my original spec text
proposal - as such I endorsed it/did not speak against it.
* But I am personally happy to state the facts and let the authors draw the
conclusions themselves. As such, I can see that the current text - with its
"(preferred)" - states a preference without proper justification within the
spec.

Hence, instead of the I18N Group's proposed change, I would like to suggest the
following, which helps the reader's understanding more:

 1)    REPLACE: 
"By using the Byte Order Mark (BOM) character (preferred)."
    WITH:
"By using the Byte Order Mark (BOM) character, which is an encoding
 signature that both XML and HTML parsers are required to support."

<!--NOTE: the phrase 'encoding signature' stems from XML 1.0
    http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding -->


 2)    REPLACE:
"By using <meta charset="UTF-8"/> (the HTML encoding declaration)."
    WITH:
"By using <meta charset="UTF-8"/> (the HTML encoding declaration) and
 thus, for XML parsers, rely on XML�s encoding default (see above)."

<!--NOTE: 'XML�s encoding default' is explained in the spec, one para
    above - and was also in my original proposal, see bug 12062. -->

The above changes states the facts about each method, in a minimum ammount of
text.


Now, some replies to the I18N Group, to Henri and to Addison:


Reply to Comment #9 - the I18N Group: 

It will be great to see the PHP test - and I don't mind putting it in the spec
somewhere as long as we can also mention the problems of the <meta
charset="UTF-8"/> method. For my own part, I use a PHP based CMS where I had no
problems adding the BOM.



Reply to Henri - Comment #10:

> ...it doesn't follow that Polyglot Markup should 
> then promote things you like within the subset.

The polyglot facts/subset/principles says that
   a) the XML and HTML DOMs should be identical, 
   b) the syntax should be legal and neccessary in HTML and XML
It follows that a feature (the BOM) that has the same effect in both XML and
HTML, is a stricter subset of XML and HTML than an feature that has effect only
in HTML (and which need the HTML5 spec�s "permission" to appear in the XML
serialization).



In reply to Addision - Comment #11

> I don't think the argument is that BOM should be removed altogether. 

Nontheless, bug 12062, which is the basis for what the spec currently says, was
titled "UTF-8 BOM should not be forbidden in Polyglot Markup". Because per 14th
of February this year, the BOM for some reason was forbidden (I wonder if the
I18N Group had a finger in that).

> What the I18N WG is asking for is that it not (perhaps erroneously) 
> be considered the "preferred" option. 

Perhaps it was an error of you to suggest that it might have been an error? ;-)
I don't see that it creates problems for anyone - not even for those tools
which do not support the BOM.  But if you can live with my compromise proposal,
then I don't need to defend "(preferred)". 

However, I would like to point out that the current spec text explicitely does
*not* state that one should only use one of the - several - encoding
declararation options. Instead the spec says:

    "&#8230; in the following ways, which may be used
       separately or in combination: &#8230;"

>From my POV, "(preferred)" is a recommendation to use the BOM - nothing more or
less. Thus it is would not, as the spec stands, have been a spec viloation to
not use the BOM or to combine it with the visible declarataion or HTTP.

(From my POV, I think I would alway - at least for HTML - include both an
external encoding declaration in HTTP as well as at least one internal -
currently that seems necessary in order to be on the safe side. But the
Polyglot Spec currently does not deal with such detailed advice.Should it?)

> Leaving aside whether this or that browser or tool responds well to BOM,

We cannot completely leave that aside, when you yourself brings in (half truth)
claims about negative effects.

> the
> BOM is invisible when properly handled and a problem when visible.

Did you mean '&#8230; and visible if not properly handled.' ? For the record:
Opera has a bug in which it swallows the BOM even if the page is ISO-8859-1
encoded. Thus, it is also a problem to not make it visible, when it should be
visible.

> Visible
> encoding declarations (when correct) make page encoding easier to 
> work with for humans.

>From my POV, there is enough 'visible' notifications of the encoding: browsers
report the encoding in one of its menus. And editors reports the encoding in a
toolbar or otherwise. And they all also tend to read the BOM as an encoding
declaration.

Still I have not protested against the fact that the Polyglot spec points to
the i18n group's recommendation to use visible encoding declarations.  It is,
as I understand it, not endorsed by the spec - it is just so that spec cites
the i18n group's claim that it is helpful, and lets the author decides whether
this consideration is something he or she wants to take ad notam. 

This is OK with me also despite the fact that I think it is an advantage to,
when possible, only declare the encoding once. Because when something has to be
declared more than once, then there is always risk that the multiple
declarations get out of sync.

> Specifying which one has priority and how to interpret each is the job of
> Polyglot, but the "preferred" is unnecessary and may actually depend on the
> user's tools and environment.

The BOM is not the only feature that relies on the tools and the environment.
All methods - the HTTP charset, the BOM and the meta charset element - depend
on the those factors.

E.g. I have more than once been using editors which did not understand the new,
HTML5 <meta charset=charset > declaration element. 

Examples: 

    * The HTML parser inside XMLLib2 (try xmllint on the command line) does not
understand <meta charset="UTF-8"/> but does instead default to ISO-8859-1.
*However*, if the document includes the BOM *or* the legacy <meta@content-type>
encoding declaration, xmllib2's HTML parser still succeeds in detecting the
encoding as UTF-8.

    * The iCab web browser, before it switched to using Webkit, supported the
BOM, but did not support <meta charset="UTF-8"/>.

    * I'm sure I could find several browsers and tools more. 

Thus, the BOM has sometimes better support than the new meta charset element.
This because BOM is both XML-compatible as well as HTML-compatible *and*
because it is older than the HTML5 encoding declaration. 

    Question: Why isn't the lack of back-compatibility for the new <meta
charset="UTF-8"/> a concern for you? Note that Polyglot spec currently says
that polyglot documents do not use the legacy encoding declaration (despite
that it is fully tolerated, withou any warning, to use it in non-polyglot
HTML). Perhaps even Polyglot Markup should tolerate the legacy
<meta@http-equiv> variant?

    PS: I don't want to hide facts: I sofar know about 3 parsers which makes
the BOM visible: the textbrowsers Lynx inserts an empty paragraph on top of the
page. Elinks inserts an empty *line* on top of the page. While Links inserts a
paragraph with a <unknown> character inside. Also, the very outdated IE5.x for
Mac behaves similar to Links, but without going into quirks mode. (For the
record: the text browsers netrik and w3m do not have this problem.)

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug.
Received on Monday, 1 August 2011 14:03:23 UTC