- From: Martin J. Duerst <duerst@w3.org>
- Date: Thu, 13 Apr 2000 11:47:05 +0900
- To: "Sujatha N. Marsden" <smarsden@etranslate.com>, w3c-i18n-ig@w3.org
- Cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org
At 00/04/12 14:55 -0700, Sujatha N. Marsden wrote: > >For the record, and this will come as no surprise, I totally oppose this > >change, because I do *not* think 16LE and 16BE are appropriate for use with > >XML, as they fly in the face of XML's orientation towards interoperability > >across heterogeneous systems. I think XML entities encoded in any flavor > >of UTF-16 should always have a BOM; exactly what the current spec [correctly > >IMHO] says. > >Should it be considered an error if it doesn't contain a BOM? IMHO, in the >absence of a BOM, UTF-16BE should be assumed. XML makes special promises in the case of UTF-16: 1) All XML processors have to be able to deal with it 2) If there is no charset information, UTF-8 or UTF-16 have to be assumed. To distinguish them, the BOM is used. It would be possible to say that XML tagged as UTF-16 but not starting with a BOM is legal. This might be desirable in the sense that it would reduce the interdependency between 'charset' definitions and the XML Rec. However, the XML spec clearly says that if it's UTF-16, it has to have a BOM, and whichever way you interpret 'UTF-16', changing it to not require a BOM when it is tagged as UTF-16 would be a clear change to the spec that I think is undesirable, as opposed to the clarifications that we are working on now. >"Text labelled "UTF-16LE" can always be interpreted as being little- > endian. The detection of an initial BOM does not affect de- > serialization of text labelled as UTF-16LE. Finding 0xFE followed by > 0xFF is an error since there is no Unicode character 0xFFFE, which > would be the interpretation of those octets under little-endian > order." > >Well, FEFF is not being interpreted as a character but as a mark which is >very different. But interestingly enough, FEFF is allowed in case UTF-16 >is the charset declaration. Finding U+FFFE in any kind of UTF-16 flavor is an error, because it's not a character. In LE, this codepoint would be encoded 0xFE 0xFF, so the above text is absolutely correct. You may find 0xFF 0xFE in LE text, even at the start, but in this case, it's not a BOM, it's just a ZWNJ. The BOM was forbidden at the start of UTF-16BE/LE among else to make this case unambiguous. Regards, Martin.
Received on Wednesday, 12 April 2000 22:51:34 UTC