writeToString, write and, UTF-16[BE|LE] (was: "LSSerializer.writeToString - what encoding declaration?") from Philippe Le Hegaret on 2004-02-04 (www-dom@w3.org from January to March 2004)

From: Philippe Le Hegaret <plh@w3.org>
Date: Wed, 04 Feb 2004 16:48:14 -0500
To: Kasimier Buchcik <kbuchcik@4commerce.de>
Cc: WWW DOM <www-dom@w3.org>
Message-Id: <1075931293.1868.126.camel@jfouffa.w3.org>

On Fri, 2004-01-30 at 08:29, Kasimier Buchcik wrote:
> Hi,
> 
> I'm trying to implement the method LSSerializer.writeToString and would 
> like to know what encoding declaration should be written if serializing 
> the document node. Should it always be "UTF-16", or should it be the 
> current xmlEncoding (e.g. ISO-8859-1)? What about if the DOMString is 
> UTF-16LE encoded, as in our implementation: should it always be "UTF-16" 
> or "UTF-16LE" in this case?
> 
> Any hints? I did not find any explicit information. The specs say: "this 
> method completely ignores all the encoding information available", but 
> I'm not sure about what this really means.

It has been clarified that writeToString is using the same encoding as
the DOMString type itself, i.e. UTF-16.

You raise an interesting point regarding the value of the XML
declaration encoding itself and I don't think we considered it in the
past. XML requires processors to understand "UTF-8" and "UTF-16" (see
section 4.3.3 of XML 1.0 [1]). In other words, the values "UTF-16BE" and
"UTF-16LE" are not required to be supported. Our reason to have them in
LSSerializer was to give the ability to choose between LE and BE, but
not necessarily to have them in the XML declaration. As indicated in
XML, entities encoded in UTF-16 MUST begin with the Byte Order Mark, so
I see no reason why the value of the XML declaration encoding should
contain "UTF-16BE" or "UTF-16LE", especially since this introduces some
interoperability troubles.

Proposal:

The LSSerializer MUST support the value "UTF-8", "UTF-16", "UTF-16LE",
and "UTF-16BE". If the value is "UTF-16", the choice between big endian
or little endian is platform dependent. If the UTF-16 encoding is in
use, the value of the XML declaration encoding (if serialized) MUST be
"UTF-16" and, as required by XML, the serialized content MUST begin with
the Byte Order Mark. If the UTF-8 encoding is in use, the value of the
XML declaration encoding (if serialized) MUST be "UTF-8", and the
serialized content SHOULD NOT begin with the Byte Order Mark.

* XML uses a MAY instead of SHOULD NOT regarding UTF-8. The
recommendation for the LSSerializer is not to generate the BOM unless
the implementation has some good reasons to do so.

Philippe

[1] http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding

Received on Wednesday, 4 February 2004 16:48:35 UTC