- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 26 Jul 2010 19:52:09 +0100
- To: <public-html@w3.org>, <www-international@w3.org>
[bringing in www-international] This is a follow on from the thread at http://lists.w3.org/Archives/Public/public-html/2010Jul/0030.html with subject renamed. You should read that thread if you haven't already. I have summarised in simplified and graphic form my understanding of the algorithm in html5 for detecting character encodings. See http://www.w3.org/International/2010/07/html5-encoding-detection.png This discussion is about what happens where there is no encoding information in the transport layer. Please see the explanation from François Yergeau below about use of BOM and UTF-16, UTF-16BE and UTF-16LE (forwarded with permission). As I understand it, you should use a BOM if you have identified or labelled the content as 'UTF-16', ie. with no indication of the endianness. The Unicode Standard also says that if you have labelled or identified your text as 'UTF-16BE' or 'UTF-16LE', you should not use a BOM (since it should be interpreted as a word joiner at the start of the text). HTML5 says: "If an HTML document does not start with a BOM, and if its encoding is not explicitly given by Content-Type metadata, and the document is not an iframe srcdoc document, then the character encoding used must be an ASCII-compatible character encoding..." http://dev.w3.org/html5/spec/semantics.html#charset This rules out the use of UTF-16BE and UTF16-LE character encodings, since they should not start with a BOM. A little later, the spec says "If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the Encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding." This rules out the use of a character encoding declaration with the value UTF-16, even in content that is encoded in that encoding. I earlier stated my preference to be able to say that a document is encoded in UTF-16 in the encoding declaration (in UTF-16 encoded documents, of course), because: [1] some people will probably add meta elements when using utf-16 encoded documents, and there's not any harm in it that I can see, so no real reason to penalise them for it. [2] i18n folks have long advised that you should always include a visible indication of the encoding in a document, HTML or XML, even if you don't strictly need to, because it can be very useful for developers, testers, or translation production managers who want to visually check the encoding of a document. I suppose, by logical extension, people will expect that it is also possible to say that a document is encoded in UTF-16BE and UTF16-LE in the declaration. That could also lead to an expectation that the encoding declaration would actually be used to determine the encoding in such cases, since the file should not then start with a BOM. In fact, in that case, the encoding detection would currently be relegated to the browser's autodetection algorithms, and the spec doesn't currently specify that they should recognise UTF-16BE and UTF16-LE, afaia. The alternative may be to make it clearer that, although UTF-16 is ok, HTML5 and XHTML5 do not accept UTF-16BE and UTF16-LE encoding declarations - only UTF-16 with a BOM (which of course covers the same serialisations). One way or the other, this appears to constitute another difference between former XHTML/XML documents and the new polyglot docs which should probably be documented. What do people think? RI From: François Yergeau [mailto:francois@yergeau.com] Sent: 15 July 2010 22:43 To: Richard Ishida Cc: 'Henry S. Thompson'; msm@w3.org Subject: Re: FW: i18n comments on Polyglot Markup Le 2010-07-15 13:06, Richard Ishida a écrit : > Can you give me any definitive answers on the questions of whether XML > requires a BOM for UTF-16 encoded documents, and whether XML processors > choke on the BOM? It depends on what you mean by "UTF-16 encoded documents". In the XML spec, a "document in the UTF-16 encoding" means (somewhat strangely, I would agree) that the document is actually in UTF-16 (OK so far) and that the encoding has been identified as "UTF-16". Not "UTF-16BE" or "UTF-16LE", these are different beasts, even though the actual encoding is of course the same. See the third sentence of the first para in 4.3.3 (http://www.w3.org/TR/REC-xml/#charencoding): "The terms "UTF-8" and "UTF-16" in this specification do not apply to related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8." So XML parsers are not strictly required to grok UTF-16 documents labelled as UTF-16BE/LE. And the BOM requirement (next sentence in 4.3.3) does not apply for such documents. The "UTF-16BE" and "UTF-16LE" labels are defined in RFC 2781, which says (Sec. 3.3): "Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text" and ditto for UTF-16LE. This of course applies to XML. So it all depends on how you label your UTF-16 encoded documents. If you label them UTF-16BE/LE, no BOM is allowed (RFC 2781). If you label them UTF-16, or do not label them (ill-advised), then a BOM is required (SHOULD ffrom RFC 2781, MUST from XML spec). As for parsers choking on the BOM, I have no actual experience, but I would consider it much more likely with UTF-8 (MAY in XML spec) than with UTF-16. The BOM requirement in UTF-16 goes back to the first edition of XML, whereas the explicit allowance for UTF-8 came with the 3rd edition (2003). I would suspect that stories about this choking date back to when Microsoft started making things like Notepad write out a BOM when saving in UTF-8, which is what triggered the clarification in XML 3rd edition. UTF-8 BOM was never explicitly disallowed, but people generally thought of the BOM only as a byte order mark, not as the encoding signature that it really is. Hence some parsers were not prepared when it started appearing in UTF-8. My 2¢. -- François
Received on Monday, 26 July 2010 18:52:39 UTC