[Bug 12062] New: UTF-8 BOM should not be forbidden in Polyglot Markup

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12062

           Summary: UTF-8 BOM should not be forbidden in Polyglot Markup
           Product: HTML WG
           Version: unspecified
          Platform: PC
               URL: http://dev.w3.org/html5/html-xhtml-author-guide/html-x
                    html-authoring-guide.html#character-encoding
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot
                    Graff)
        AssignedTo: eliotgra@microsoft.com
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: davidc@nag.co.uk, mike@w3.org,
                    public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org, shadow2531@gmail.com,
                    xn--mlform-iua@xn--mlform-iua.no,
                    eliotgra@microsoft.com


The spec says:

]]  When polyglot markup uses UTF-8, it does not include a BOM.  [[

I recomment to delete the above statement. Because I can see no basis for it.
In my view, the UTF-8 BOM can *help* working with polyglot markup.
Justificaiton:

 a) the UTF-8 BOM is understood by both XML and HTML5 parsers;
 b) the UTF-8 BOM allows you to not use <meta> @http-equiv="content-type"
      or @charset (which, nevertheless, only HTML parsers know)

For offline parsing via file:// URLs, the presence of an UTF-8 BOM seems to me
as an advantage. For online parsing it also offers the advantage that it
provides encoding information even if HTTP fails to provide such information.

The fact that some (very) legacy user agents may act up if they see the UTF-8
BOM has not prevented HTML5 from permitting it. Thus, if the UTF-8 BOM should
be declared as something that is not used in Polyglot Markup, then please
provide a justification/principle for such a decision. 

Further more, the following statement from the same sections seems to
contradict the statement that the UTF-8 BOM should not be used:

]]
Polyglot markup declares character encoding one of two ways:
   By using the BOM.
   In the HTTP header of the response [HTTP11], as in the following:
[[

If you accept my argument that the UTF-8 BOM can be used, then I suggest
replacing the above quote with following, more accurate reformulation:

]]
Polyglot markup declares character encoding via the following ways, that might
be used separately or in combination, as long as they contains the same
encoding information:
   Inside the document: 
     * by the use of a BOM;
     * by relying of the  XML UTF-8 encoding default in combination with <meta
charset="UTF-8"/>
   In the HTTP header [ etc - keep the current text ]
[[

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Monday, 14 February 2011 13:31:33 UTC