W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > March 2011

[Bug 12062] UTF-8 BOM should not be forbidden in Polyglot Markup

From: <bugzilla@jessica.w3.org>
Date: Sat, 12 Mar 2011 21:26:06 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1PyWK6-0005dd-5P@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12062

Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |

--- Comment #13 from Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> 2011-03-12 21:26:04 UTC ---
(In reply to comment #12)

Some things seems like repetition. And the text doesn't fully reflect the fact
that UTF-8 now is the only encoding - that changes a few things. And my many
suggestions have made it  a bit long. So, once more:

FIRST, I think you should say, as the very first thing, that UTF-8 is the
encoding of polyglot markup. So before the very first paragraph ("Polyglot
markup declares character encoding in the following ways"), I suggest saying
this  (stealing thoughts from bug 12242):

]] 
  Polyglot markup uses the UTF-8 encoding, the only encoding that both HTML and
XML parsers are REQUIRED to support. For HTML, the UTF-8 encoding MUST be
declared to avoid that user agents defaults to the locale encoding. For XML,
then UTF-8 is the encoding default and as such MAY be be left undeclared.
[[

> ]]
> Polyglot markup declares character encoding in the following ways, which may be
> used separately or in combination (if used in combination, each approach
> contains identical encoding information): 
> •Within the document
>     &#9702;By using the Byte Order Mark (BOM) character (preferred).
>     &#9702;By relying on UTF-8 as the encoding default of XML, used in combination
>                   with the HTML <meta charset="UTF-8"/> element.

I suggest referring to that element as "the HTML encoding declaration". Hence,
reformulation of last sentence:

]]
 &#9702;By relying on UTF-8 as the encoding default of XML, used in 
   combination with the HTML encoding declaration: <meta charset="UTF-8"/>
[[

> •Within the document
>     &#9702;By using the Byte Order Mark (BOM) character (preferred).
>     &#9702;By relying on UTF-8 as the encoding default of XML, used in combination
> with the HTML <meta charset="UTF-8"/> element.

<questionmark>
> •In the HTTP header of the response [HTTP11], as in the following: 
>     Content-type: text/html; charset=utf-8
>  Note that polyglot markup may use either text/html or application/xhtml+xml
>  for the value of the content type. 
</questionmark>

In the introduction, you say: 
   ]] Other permissible MIME types are text/xml, application/xml, and any MIME
type whose subtype ends with the four characters "+xml". [[ 
Thus this note, which limits the mime type to just two, does not reflect the
introduction. 

As well: In the name of "Show. Don't tell.", I suggest stating the HTTP
section, like so (important to show an example for application/xhtml+xml as
well):

]]
•In the HTTP header of the response [HTTP11]: 
     &#9702; For HTML: Content-type: text/html; charset=utf-8
     &#9702; For XHTML: Content-type: application/xhtml+xml; charset=utf-8
     (And the same pattern for other pemissable MIME types, see the
Introduction.)
[[

> Using <meta charset="*"/> has no effect in XML. Therefore, polyglot markup may
> use <meta charset="*"/> provided the document is encoded as UTF-8 and the value
> of charset is a case-insensitive match for the string "utf-8". 

The phrase "provided the document is encoded as UTF-8 ' does not make sense now
that UTF-8 is the only encoding of polyglot markup. How about this remake:

    ]]
 NOTE: Unlike using the BOM character, the HTML encoding declaration (<meta
charset="UTF-8"/>) has no effect in XML. But because the UTF-8 encoding is the
encoding default of XML, it represents accurate information and can be used. 

   [ And perhaps the note about what the i18n Group's recommendation to always
include a visible declaration should be moved up here? ]

    [[

<delete>
> Polyglot markup uses UTF-8 encoding. 
</delete>

This would be a repetition, given that I suggested to say this as the very
first thing - see above.

<delete>
> The BOM character may be used with the UTF-8 encoding 
>(see Writing HTML documents in [HTML5]), and using the BOM
> character is preferred to not using the BOM character.
</delete>

As UTF-8 is the only encoding, "may be used with the UTF-8 encoding " is not
necessary to say. Did you mean to say "can be used"? May be we can just delete
it? In the beginning of this section, BOM chareacter is listed. Hence, it does
not feel necessary to repeat here that it can b used.

<delete>
> Because the construct of the BOM character is the same for XML
> and HTML (unlike the encoding declaration inside
> the HTTP Content-Type header) and because the BOM character works in
> both XML and HTML (unlike the <meta charset="UTF-8"/> declaration of 
> HTML and the UTF-8 encoding default of XML), 
> the BOM character can be said to be the
> most polyglot encoding declaration. 
</delete>

By adding "Note: Unlike using the BOM character" above, I think the above lines
can be deleted as well. And the point (which I made) that the MIME type
differs, can now - if you add application/xhtml+xml as I suggested above, be
spotted in the "In the HTTP header of the response [HTTP11]" list.

> The W3C Internationalization (i18n) Group recommends to always include a
> visible encoding declaration in a document, because it helps developers,
> testers, or translation production managers to check the encoding of a document
> visually. 
> [[

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Saturday, 12 March 2011 21:26:15 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 12 March 2011 21:26:18 GMT