"Re: write and UTF-16[BE|LE]" from Kasimier Buchcik on 2004-02-05 (www-dom@w3.org from January to March 2004)

From: Kasimier Buchcik <kbuchcik@4commerce.de>
Date: Thu, 05 Feb 2004 14:29:52 +0100
To: <www-dom@w3.org>
Message-ID: <40224550.7070002@4commerce.de>

Hi,

on 2/4/2004 11:39 PM Philippe Le Hegaret wrote:
> On Wed, 2004-02-04 at 17:26, jcowan@reutershealth.com wrote:
> 
>>Philippe Le Hegaret scripsit:
>>
>>>As indicated in
>>>XML, entities encoded in UTF-16 MUST begin with the Byte Order Mark, so
>>>I see no reason why the value of the XML declaration encoding should
>>>contain "UTF-16BE" or "UTF-16LE", especially since this introduces some
>>>interoperability troubles.
>>
>>That means that entities encoded in the encoding named "UTF-16" must begin
>>with a BOM.  Entities in the encodings "UTF-16BE" and "UTF-16LE" must not
>>begin with a BOM, but must have an appropriate encoding declaration.

Yes, I think so.

> Looking again at XML 1.0 3rd, it says that UTF-16 encoded entities MUST
> being with a BOM. Unless I'm misinterpreting the meaning of "UTF-16
> encoded entities", I would say that it does include UTF16-BE and
> UTF16-LE as well.

RFC 2781 does say:

"Text in the "UTF-16LE" charset MUST be serialized with the octets
  which make up a single 16-bit UTF-16 value in little-endian order.
  Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text."

But the dilemma I see is that our Delphi implementation needs the 
DOMString not to have a BOM. And this is fine when reading the Load & 
Save candidate recommendation:

(http://www.w3.org/TR/2003/CR-DOM-Level-3-LS-20031107/load-save.html)

"When outputting unicode data, whether or not a byte order mark is 
serialized, or if the output is big-endian or little-endian, is 
implementation dependent."

Maby it's just my understanding of the DOMString until now, which seemed 
to be quite bound to the implementing application. If I get a 
Node.nodeValue (in our Delphi implementation) I expect the DOMString to 
be encoded in UTF-16, little-endian with no BOM. If I serialize with 
LSSerializer.writeToString I would get a UTF-16 with a BOM - as the XML 
spec states. This would arise problems with DOMString operations, since 
I cannot predict if the DOMString came from the LSSerializer or not. I 
assumed that the DOMString was designed to be of consistent structure in 
an application, and that LSOutput.characterStream and 
LSOutput.byteStream would play the role of a *pure* XML entity.

So is the DOMString really intended to represent a XML entity or should 
it be handled more like a interface to the implementing programming 
language?

Thanks and regards,

Kasimier Buchcik

Received on Thursday, 5 February 2004 08:26:07 UTC