"Re: writeToString, write and, UTF-16[BE|LE]" from Kasimier Buchcik on 2004-02-09 (www-dom@w3.org from January to March 2004)

From: Kasimier Buchcik <kbuchcik@4commerce.de>
Date: Mon, 09 Feb 2004 11:45:10 +0100
To: Philippe Le Hegaret <plh@w3.org>, <www-dom@w3.org>
Message-ID: <402764B6.1050807@4commerce.de>
Hi,

on 2/4/2004 10:48 PM Philippe Le Hegaret wrote:
> On Fri, 2004-01-30 at 08:29, Kasimier Buchcik wrote:
> 
>>Hi,
>>
>>I'm trying to implement the method LSSerializer.writeToString and would 
>>like to know what encoding declaration should be written if serializing 
>>the document node. Should it always be "UTF-16", or should it be the 
>>current xmlEncoding (e.g. ISO-8859-1)? What about if the DOMString is 
>>UTF-16LE encoded, as in our implementation: should it always be "UTF-16" 
>>or "UTF-16LE" in this case?
>>
>>Any hints? I did not find any explicit information. The specs say: "this 
>>method completely ignores all the encoding information available", but 
>>I'm not sure about what this really means.
> 
> 
> It has been clarified that writeToString is using the same encoding as
> the DOMString type itself, i.e. UTF-16.
> 
> You raise an interesting point regarding the value of the XML
> declaration encoding itself and I don't think we considered it in the
> past. XML requires processors to understand "UTF-8" and "UTF-16" (see
> section 4.3.3 of XML 1.0 [1]). In other words, the values "UTF-16BE" and
> "UTF-16LE" are not required to be supported. Our reason to have them in
> LSSerializer was to give the ability to choose between LE and BE, but
> not necessarily to have them in the XML declaration. As indicated in
> XML, entities encoded in UTF-16 MUST begin with the Byte Order Mark, so
> I see no reason why the value of the XML declaration encoding should
> contain "UTF-16BE" or "UTF-16LE", especially since this introduces some
> interoperability troubles.
> 
> Proposal:
> 
> The LSSerializer MUST support the value "UTF-8", "UTF-16", "UTF-16LE",
> and "UTF-16BE". If the value is "UTF-16", the choice between big endian
> or little endian is platform dependent. If the UTF-16 encoding is in
> use, the value of the XML declaration encoding (if serialized) MUST be
> "UTF-16" and, as required by XML, the serialized content MUST begin with
> the Byte Order Mark. If the UTF-8 encoding is in use, the value of the
> XML declaration encoding (if serialized) MUST be "UTF-8", and the
> serialized content SHOULD NOT begin with the Byte Order Mark.
> 
> * XML uses a MAY instead of SHOULD NOT regarding UTF-8. The
> recommendation for the LSSerializer is not to generate the BOM unless
> the implementation has some good reasons to do so.
> 
> Philippe
> 
> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding

Although I might get flamed about repeating a question (I posted it 
further down the tread), I still need to clarify the format of the 
DOMString if using LSSerializer.writeToString. As you wrote, I see that 
the declaration needs to be "UTF-16". But is it required to use a BOM?

---

(http://www.w3.org/TR/2003/CR-DOM-Level-3-LS-20031107/load-save.html)

"When outputting unicode data, whether or not a byte order mark is
serialized, or if the output is big-endian or little-endian, is
implementation dependent."
---

Unicode 4.0, 2.6 Encoding Schemes:

"When a higher-level protocol supplies mechanisms for handling the 
endianness of integral data types, it is not necessary to use Unicode 
encoding schemes or the byte order mark. In those cases Unicode text is 
simply a sequence of integral data types."

"Note that some of the Unicode encoding schemes have the same labels as 
the three Unicode encoding forms. This could cause confusion, so it is 
important to keep the context clear when using these terms: character 
encoding forms refer to integral data units in memory or in APIs, and 
byte order is irrelevant; character encoding schemes refer to 
byte-serialized data, as for streaming I/O or in file storage, and byte 
order must be specified or determinable."
---

If I get a Node.nodeValue (in our Delphi implementation) I expect the 
DOMString to be encoded in UTF-16, little-endian with no BOM. If I 
serialize with LSSerializer.writeToString I would get a UTF-16 with a 
BOM - as the XML spec states. This would arise problems with DOMString 
operations, since I would have to check first if some of the DOMStrings 
has a BOM. I assumed that the DOMString was designed to hold integral 
data and that LSOutput.characterStream and LSOutput.byteStream would be 
expected to fulfill the requirements of an encoding scheme.

So, once more: has the DOMString to hold a BOM if serializing with 
LSSerializer.writeToString?


Regards,

Kasimier Buchcik
Received on Monday, 9 February 2004 05:41:19 UTC