- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 09 Oct 2003 15:04:36 -0400
- To: Johnny Stenback <jst@w3c.jstenback.com>, Francois Yergeau <FYergeau@alis.com>
- Cc: "'www-dom@w3.org'" <www-dom@w3.org>, "'w3c-i18n-ig@w3.org'" <w3c-i18n-ig@w3.org>
At 22:55 03/10/08 -0700, Johnny Stenback wrote: >Francois Yergeau wrote: >[...] >>While this is sufficient for strict interoperability, it is not for >>compatibility of code. If there is not at least one required encoding, it >>is not possible to write a DOM program that will work over any DOM >>implementation. We insist that at least UTF-8 be required. Furthermore, >>since XML 1.0 did it back in 1998, it cannot be so onerous to require all 3. >>Please reconsider. > >Agreed, the spec now requires that those 3 encodings must be supported >when dealing with XML data. This is very valuable progress. However, I wonder how the DOM is able to make the distinction between little-endian and big-endian versions of UTF-16. Please note that simply using UTF-16LE and UTF-16BE does not solve this issue, because neither UTF-16LE (as defined in RFC 2781) nor UTF-16BE (as also defined in RFC 2781) take a BOM, but UTF-16 for XML requires a BOM. There are the following four variants for UTF-16?? and XML: UTF-16, big-endian: - requires BOM - parsers required to parse - if labeled (Content-Type header or 'encoding' pseudo-attribute), label is "UTF-16" UTF-16, little-endian: - requires BOM - parsers required to parse - if labeled (Content-Type header or 'encoding' pseudo-attribute), label is "UTF-16" UTF-16BE: - BOM prohibited - parsers may or may not parse (like any legacy encoding) - Needs to be labeled: Content-Type: type/subtype;charset=UTF-16BE or <?xml version='1.0' encoding='UTF-16BE'?> UTF-16LE: - BOM prohibited - parsers may or may not parse (like any legacy encoding) - Needs to be labeled: Content-Type: type/subtype;charset=UTF-16LE or <?xml version='1.0' encoding='UTF-16LE'?> My recommendation is that the DOM requires that DOM implementations are able to write out 'UTF-8' and 'UTF-16', and that the choice of endianness for UTF-16 is left to the implementation (because XML Processors can deal with both endiannesses, and there is always a BOM). The alternatives would be: - To introduce an additional parameter to be able to specify the endianness if the encoding choosen is UTF-16. Probably too much work for what it gets you. - To redefine 'UTF-16BE' and 'UTF-16LE' to mean 'UTF-16 big endian' and 'UTF-16 little endian' only for the specific case of DOM save. This would create a rather confusing special case, would make it impossible to actually write out something like 'UTF-16BE' (as defined in RFC 2781), would require somebody who wants to write out UTF-16 to change from 'UTF-16' to 'UTF-16BE' (most people would probably forget to do so), and would still require to define what happens on output if the encoding is 'UTF-16' (anything from 'always big endian' to 'implementation defined' to 'forbidden'). Regards, Martin.
Received on Thursday, 9 October 2003 15:11:43 UTC