Re: i18n reviews of DOM 3 Core and Load&Save from Martin Duerst on 2003-10-09 (www-dom@w3.org from October to December 2003)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 09 Oct 2003 15:04:36 -0400
To: Johnny Stenback <jst@w3c.jstenback.com>, Francois Yergeau <FYergeau@alis.com>
Cc: "'www-dom@w3.org'" <www-dom@w3.org>, "'w3c-i18n-ig@w3.org'" <w3c-i18n-ig@w3.org>
Message-Id: <4.2.0.58.J.20031009142821.0586e940@localhost>

At 22:55 03/10/08 -0700, Johnny Stenback wrote:

>Francois Yergeau wrote:
>[...]

>>While this is sufficient for strict interoperability, it is not for
>>compatibility of code.  If there is not at least one required encoding, it
>>is not possible to write a DOM program that will work over any DOM
>>implementation.  We insist that at least UTF-8 be required.  Furthermore,
>>since XML 1.0 did it back in 1998, it cannot be so onerous to require all 3.
>>Please reconsider.
>
>Agreed, the spec now requires that those 3 encodings must be supported 
>when dealing with XML data.

This is very valuable progress. However, I wonder how the DOM is able
to make the distinction between little-endian and big-endian versions
of UTF-16.

Please note that simply using UTF-16LE and UTF-16BE does not solve this
issue, because neither UTF-16LE (as defined in RFC 2781) nor UTF-16BE
(as also defined in RFC 2781) take a BOM, but UTF-16 for XML requires
a BOM.

There are the following four variants for UTF-16?? and XML:

UTF-16, big-endian:
     - requires BOM
     - parsers required to parse
     - if labeled (Content-Type header or 'encoding' pseudo-attribute),
       label is "UTF-16"

UTF-16, little-endian:
     - requires BOM
     - parsers required to parse
     - if labeled (Content-Type header or 'encoding' pseudo-attribute),
       label is "UTF-16"

UTF-16BE:
     - BOM prohibited
     - parsers may or may not parse (like any legacy encoding)
     - Needs to be labeled:
           Content-Type: type/subtype;charset=UTF-16BE   or
           <?xml version='1.0' encoding='UTF-16BE'?>

UTF-16LE:
     - BOM prohibited
     - parsers may or may not parse (like any legacy encoding)
     - Needs to be labeled:
           Content-Type: type/subtype;charset=UTF-16LE   or
           <?xml version='1.0' encoding='UTF-16LE'?>

My recommendation is that the DOM requires that DOM implementations
are able to write out 'UTF-8' and 'UTF-16', and that the choice of
endianness for UTF-16 is left to the implementation (because
XML Processors can deal with both endiannesses, and there is
always a BOM).

The alternatives would be:
- To introduce an additional parameter to be able to specify
   the endianness if the encoding choosen is UTF-16. Probably
   too much work for what it gets you.
- To redefine 'UTF-16BE' and 'UTF-16LE' to mean 'UTF-16 big
   endian' and 'UTF-16 little endian' only for the specific
   case of DOM save. This would create a rather confusing special
   case, would make it impossible to actually write out something
   like 'UTF-16BE' (as defined in RFC 2781), would require
   somebody who wants to write out UTF-16 to change from
   'UTF-16' to 'UTF-16BE' (most people would probably forget to
   do so), and would still require to define what happens on
   output if the encoding is 'UTF-16' (anything from 'always
   big endian' to 'implementation defined' to 'forbidden').


Regards,    Martin.

Received on Thursday, 9 October 2003 15:11:43 UTC