- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Fri, 30 Jul 2010 13:14:47 +0900
- To: John Cowan <cowan@ccil.org>
- CC: Richard Ishida <ishida@w3.org>, francois@yergeau.com, public-html@w3.org, www-international@w3.org
Hello John, Francois (long time no see!) an others, On 2010/07/27 6:14, John Cowan wrote: > On 15 July 2010 22:43, François Yergeau wrote: > >> It depends on what you mean by "UTF-16 encoded documents". In the XML >> spec, a "document in the UTF-16 encoding" means (somewhat strangely, I >> would agree) that the document is actually in UTF-16 (OK so far) and >> that the encoding has been identified as "UTF-16". Not "UTF-16BE" or >> "UTF-16LE", these are different beasts, even though the actual encoding >> is of course the same. Yes. For us humans, UTF-16, UTF-16BE, and UTF-16LE look extremely close (because they are). However, on a spec level, and on an implementation level, these are just different encodings with different 'charset' labels. Different encodings and different labels means that UTF-16BE and UTF-16LE are treated just the same way as 'foo-unknown-bar', a label that I just made up now (and won't bother to define the actual encoding). If some XML parser sees <?xml version='1.0' encoding='foo-unknown-bar' ?> it will just say "unknown encoding" or some such, or it might (for values other than 'foo-unknown-bar', actually defined and registered) use that character encoding for decoding and parsing. The same is supposed to happen if you have something like <?xml version='1.0' encoding='Extended_UNIX_Code_Fixed_Width_for_Japanese' ?> (with null bytes before each of the bytes/characters that you can see above; check out Extended_UNIX_Code_Fixed_Width_for_Japanese at http://www.iana.org/assignments/character-sets): Either the parser knows this encoding, or it doesn't. The same also has to apply if you happen to have: <?xml version='1.0' encoding='UTF-16BE' ?> or <?xml version='1.0' encoding='UTF-16LE' ?> (again with null bytes sprinkled in, which are omitted here for your mail software's convenience). Why? Because everybody who implemented an XML parser before UTF-16BE or UTF-16LE were defined was just checking for "UTF-16", and there is no way for a parser to suddenly say "oh, 'UTF-16BE' seems to start with 'UTF-16', so maybe these are related, so...". The rule "'UTF-16' in XML MUST have a BOM" (which is not true of 'UTF-16' in general) makes sense and can be justified because 'UTF-16' has the privilege of being groked by all parsers, even without explicit label. However, creating special rules for UTF-16BE or UTF-16LE would not have made sense, because we don't really know for sure what other encodings similar to UTF-16 might turn up (I very much hope none!). > The reason for that is XML-specific. An XML document entity cannot > begin with a ZWNBSP, so if it begins with the bytes 0xFF 0xFE, it must > be a UTF-16 entity body with a BOM. But if the entity is an external > parsed entity (document fragment) or external parameter entity (DTD > fragment), then it may begin with any XML-legal Unicode character, > including ZWNBSP, which would also be 0xFF 0xFE in UTF-16LE encoding. > The result is a nasty ambiguity: does the document's character content > start with ZWNBSP or not? (And analogously for 0xFE 0xFF and > UTF-16BE.) > > The Core WG decided to resolve the ambiguity in favor of the UTF-16 > encoding. An external entity that appears to begin with a BOM does > begin with a BOM. If you need to create an external entity beginning > with a ZWNBSP, you use UTF-8 or UTF-16, or else you use an explicit > encoding declaration. Or use a numeric character entity &-). >> So XML parsers are not strictly required to grok UTF-16 documents >> labelled as UTF-16BE/LE. > > Correct. The only encodings a parser is absolutely required to > support are UTF-8 (with or without BOM) and UTF-16 (with BOM). Yes, correct. In a strict view, there are no "UTF-16 documents labeled as UTF-16BE (or ..LE)". A document labeled as UTF-16BE is a document encoded in UTF-16BE. A document labeled as UTF-16LE is a document encoded as UTF-16LE. That these (with the exception that those labeled 'UTF-16' have a leading BOM) are essentially byte-for-byte identical is, on a spec and implementation level, just an accident. Compare that to documents labeled as UTF-8 (or unlabeled), labeled as US-ASCII, as ISO-8859-1/2/.., and which only contains characters in the ASCII range. These are all documents labeled differently and therefore in different encodings, although they are byte-for-byte identical. I remember that I once spent an hour or so on the phone working on convincing Tim Bray that this was the only sound (from a spec/implementation point, not necessarily from a human higher level understanding point) way to look at and deal with these things. That must have been something like 10 years ago. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Friday, 30 July 2010 04:15:29 UTC