At 22:55 03/10/08 -0700, Johnny Stenback wrote: >Francois Yergeau wrote: >[...] >>While this is sufficient for strict interoperability, it is not for >>compatibility of code. If there is not at least one required encoding, it >>is not possible to write a DOM program that will work over any DOM >>implementation. We insist that at least UTF-8 be required. Furthermore, >>since XML 1.0 did it back in 1998, it cannot be so onerous to require all 3. >>Please reconsider. > >Agreed, the spec now requires that those 3 encodings must be supported >when dealing with XML data. This is very valuable progress. However, I wonder how the DOM is able to make the distinction between little-endian and big-endian versions of UTF-16. Please note that simply using UTF-16LE and UTF-16BE does not solve this issue, because neither UTF-16LE (as defined in RFC 2781) nor UTF-16BE (as also defined in RFC 2781) take a BOM, but UTF-16 for XML requires a BOM. There are the following four variants for UTF-16?? and XML: UTF-16, big-endian: - requires BOM - parsers required to parse - if labeled (Content-Type header or 'encoding' pseudo-attribute), label is "UTF-16" UTF-16, little-endian: - requires BOM - parsers required to parse - if labeled (Content-Type header or 'encoding' pseudo-attribute), label is "UTF-16" UTF-16BE: - BOM prohibited - parsers may or may not parse (like any legacy encoding) - Needs to be labeled: Content-Type: type/subtype;charset=UTF-16BE or <?xml version='1.0' encoding='UTF-16BE'?> UTF-16LE: - BOM prohibited - parsers may or may not parse (like any legacy encoding) - Needs to be labeled: Content-Type: type/subtype;charset=UTF-16LE or <?xml version='1.0' encoding='UTF-16LE'?> My recommendation is that the DOM requires that DOM implementations are able to write out 'UTF-8' and 'UTF-16', and that the choice of endianness for UTF-16 is left to the implementation (because XML Processors can deal with both endiannesses, and there is always a BOM). The alternatives would be: - To introduce an additional parameter to be able to specify the endianness if the encoding choosen is UTF-16. Probably too much work for what it gets you. - To redefine 'UTF-16BE' and 'UTF-16LE' to mean 'UTF-16 big endian' and 'UTF-16 little endian' only for the specific case of DOM save. This would create a rather confusing special case, would make it impossible to actually write out something like 'UTF-16BE' (as defined in RFC 2781), would require somebody who wants to write out UTF-16 to change from 'UTF-16' to 'UTF-16BE' (most people would probably forget to do so), and would still require to define what happens on output if the encoding is 'UTF-16' (anything from 'always big endian' to 'implementation defined' to 'forbidden'). Regards, Martin.Received on Thursday, 9 October 2003 15:11:43 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 3 May 2007 00:17:16 GMT