- From: John Cowan <cowan@ccil.org>
- Date: Mon, 26 Jul 2010 17:14:58 -0400
- To: Richard Ishida <ishida@w3.org>, francois@yergeau.com
- Cc: public-html@w3.org, www-international@w3.org
On Mon, Jul 26, 2010 at 2:52 PM, Richard Ishida <ishida@w3.org> wrote: > I have summarised in simplified and graphic form my understanding of the > algorithm in html5 for detecting character encodings. See > http://www.w3.org/International/2010/07/html5-encoding-detection.png Part of the problem, I think, is the confusion between "UTF-16LE" as the formal name of an encoding, which does not permit a BOM and is always little-endian, and the informal use of "utf16le" in your diagram to mean a document which in fact uses little-endian encoding. (And analogously for "UTF16-BE".) > Please see the explanation from François Yergeau below about use of BOM and > UTF-16, UTF-16BE and UTF-16LE (forwarded with permission). As I understand > it, you should use a BOM if you have identified or labelled the content as > 'UTF-16', ie. with no indication of the endianness. The Unicode Standard > also says that if you have labelled or identified your text as 'UTF-16BE' or > 'UTF-16LE', you should not use a BOM (since it should be interpreted as a > word joiner at the start of the text). Correct. > This rules out the use of UTF-16BE and UTF16-LE character encodings, since > they should not start with a BOM. That's true, but those encodings aren't really very useful: you only save two bytes. The main use for UTF-16BE and UTF-16LE encodings is when you can't afford a BOM, because you have zillions of short strings to deal with, as in a database. > A little later, the spec says > "If an HTML document contains a meta element with a charset attribute or a > meta element with an http-equiv attribute in the Encoding declaration > state, then the character encoding used must be an ASCII-compatible > character encoding." > > This rules out the use of a character encoding declaration with the value > UTF-16, even in content that is encoded in that encoding. That *is* a Bad Thing, and should be fixed. The simplest operational approach is just to make sure that 0x00 bytes are always ignored when parsing encoding declarations. See my posting "Hello! I am an XML encoding sniffer!" at http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html . On 15 July 2010 22:43, François Yergeau wrote: > It depends on what you mean by "UTF-16 encoded documents". In the XML > spec, a "document in the UTF-16 encoding" means (somewhat strangely, I > would agree) that the document is actually in UTF-16 (OK so far) and > that the encoding has been identified as "UTF-16". Not "UTF-16BE" or > "UTF-16LE", these are different beasts, even though the actual encoding > is of course the same. The reason for that is XML-specific. An XML document entity cannot begin with a ZWNBSP, so if it begins with the bytes 0xFF 0xFE, it must be a UTF-16 entity body with a BOM. But if the entity is an external parsed entity (document fragment) or external parameter entity (DTD fragment), then it may begin with any XML-legal Unicode character, including ZWNBSP, which would also be 0xFF 0xFE in UTF-16LE encoding. The result is a nasty ambiguity: does the document's character content start with ZWNBSP or not? (And analogously for 0xFE 0xFF and UTF-16BE.) The Core WG decided to resolve the ambiguity in favor of the UTF-16 encoding. An external entity that appears to begin with a BOM does begin with a BOM. If you need to create an external entity beginning with a ZWNBSP, you use UTF-8 or UTF-16, or else you use an explicit encoding declaration. > So XML parsers are not strictly required to grok UTF-16 documents > labelled as UTF-16BE/LE. Correct. The only encodings a parser is absolutely required to support are UTF-8 (with or without BOM) and UTF-16 (with BOM).
Received on Monday, 26 July 2010 21:15:53 UTC