- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 16 May 2001 00:35:13 +0900
- To: "Gregor Karlinger" <gregor.karlinger@iaik.at>, <merlin@baltimore.ie>, "Tom Gindin" <tgindin@us.ibm.com>
- Cc: <w3c-ietf-xmldsig@w3.org>
At 15:14 01/05/15 +0200, Gregor Karlinger wrote: [slightly reordered] >Doesn't this mean that you MUST use UTF-8 encoding for the XML >document, i. e. that it is impossible to use ASCII or ISO_8859* >since you insert a text node (the DN) which characters are Unicode? > >If so, should we really introduce such a restriction? An XML document consists of characters, not bytes. And in content, each character can be escaped by a numeric character reference. Therefore, you can have a document only encoded in us-ascii, but will lots of &#dddd; (or &#xhhhh;); this will transport any character in Unicode, and a browser will display it correctly (assuming it has the right font). > > There's another issue that seems relevant. RFC 2253 states > > that strings must be converted to UTF-8 and then the escaping > > rules must be applied. Do we honour this, or should we UTF-8 > > decode the RFC2253 string before embedding it in the text node. > > > > Essentially, should the final example in RFC 2253 be encoded > > in XML as: > > > > UTF-8 encode and require ASCII escaping of high-bit-set chars: > > SN=Lu\C4\8Di\C4\87 This would be okay. > > UTF-8 encode and embed the result directly: > > SN=Lu??i?? (where ? is a high-bit UTF-8 byte directly embedded) > > (Here the meaning is confusing because the UTF-8 encoded > > text will correspond to some other Unicode charactes, e.g. ト) This would be a catastrophe. You cannot put arbitrary byte sequences into and XML document. UTF-8 for example uses bytes in the range 0x80-0x9F, which are illegal is iso-8859-X. Such a file would have to be rejected by the XML parser. > > De-UTF-8 and embed the Unicode original: > > SN=Lu?i? (where ? is the original character) > > > > The last seems like the best option to me. I agree. But it has to be spelled out clearly, and tested. Regards, Martin.
Received on Tuesday, 15 May 2001 11:37:18 UTC