- From: Philippe Le Hegaret <plh@w3.org>
- Date: Wed, 04 Feb 2004 16:48:14 -0500
- To: Kasimier Buchcik <kbuchcik@4commerce.de>
- Cc: WWW DOM <www-dom@w3.org>
- Message-Id: <1075931293.1868.126.camel@jfouffa.w3.org>
On Fri, 2004-01-30 at 08:29, Kasimier Buchcik wrote: > Hi, > > I'm trying to implement the method LSSerializer.writeToString and would > like to know what encoding declaration should be written if serializing > the document node. Should it always be "UTF-16", or should it be the > current xmlEncoding (e.g. ISO-8859-1)? What about if the DOMString is > UTF-16LE encoded, as in our implementation: should it always be "UTF-16" > or "UTF-16LE" in this case? > > Any hints? I did not find any explicit information. The specs say: "this > method completely ignores all the encoding information available", but > I'm not sure about what this really means. It has been clarified that writeToString is using the same encoding as the DOMString type itself, i.e. UTF-16. You raise an interesting point regarding the value of the XML declaration encoding itself and I don't think we considered it in the past. XML requires processors to understand "UTF-8" and "UTF-16" (see section 4.3.3 of XML 1.0 [1]). In other words, the values "UTF-16BE" and "UTF-16LE" are not required to be supported. Our reason to have them in LSSerializer was to give the ability to choose between LE and BE, but not necessarily to have them in the XML declaration. As indicated in XML, entities encoded in UTF-16 MUST begin with the Byte Order Mark, so I see no reason why the value of the XML declaration encoding should contain "UTF-16BE" or "UTF-16LE", especially since this introduces some interoperability troubles. Proposal: The LSSerializer MUST support the value "UTF-8", "UTF-16", "UTF-16LE", and "UTF-16BE". If the value is "UTF-16", the choice between big endian or little endian is platform dependent. If the UTF-16 encoding is in use, the value of the XML declaration encoding (if serialized) MUST be "UTF-16" and, as required by XML, the serialized content MUST begin with the Byte Order Mark. If the UTF-8 encoding is in use, the value of the XML declaration encoding (if serialized) MUST be "UTF-8", and the serialized content SHOULD NOT begin with the Byte Order Mark. * XML uses a MAY instead of SHOULD NOT regarding UTF-8. The recommendation for the LSSerializer is not to generate the BOM unless the implementation has some good reasons to do so. Philippe [1] http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding
Received on Wednesday, 4 February 2004 16:48:35 UTC