- From: John Boyer <jboyer@PureEdge.com>
- Date: Wed, 29 Nov 2000 15:58:23 -0800
- To: "Paul Hoffman / IMC" <phoffman@imc.org>, <w3c-ietf-xmldsig@w3.org>
Hi Paul, Yes, my prior email incorrectly used BMP to refer to the 17 encoding planes available to UTF-16. This occured while skimming over UTF-7, which I assumed was just another way of doing UTF-8 but in fact seems only capable of encoding the BMP. So, in any prior email, my statements about the BMP should actually have been directed to "the set of 17 character planes available to UTF-16". In my mind, it would seem that the BMP is misnamed since we seem to need a lot of stuff outside of it to meet the basic language needs of earthlings. Perhaps UTF-16 should become the Basic Multilingual "3-space". :) Anyway, Jeff's point about UCS-2 != Unicode has now hit home thanks to some of the examples in the UTF-8 spec. These examples clearly show triplets of UCS-2 values being used to form a single character, which does not appear to be permissible under UTF-16. Since the Unicode manual is quite clear on the equivalence between Unicode and UTF-16 (p. 19), this would mean that UCS-2 != Unicode. So it would seem that we need to include UCS-2 in the list of things that should not have NFC applied. This leads us to your suggestion, which is to say "REQUIRED to use Normalization Form C [NFC] when converting an XML document to the UCS character domain from a non-UCS encoding". Based on what I've read over the last two days, this does not work for me. The XML standard makes it clear that you can actually encode a document in native UCS-4. You do not have to use UTF-8 or UTF-16. So, the statement you suggest would be interpreted as requiring NFC on UTF-8 and UTF-16 encodings, since the UCS encoding is something different. Moreover, taking into account your suggestion to pretend UTF-7 never existed combined with Tom's suggestion that we should explicitly say what Unicode means, the following is probably the best so far: "REQUIRED to use Normalization Form C [NFC] when converting an XML document to the UCS character domain from an encoding other than a UCS encoding, UTF-8, UTF-16, UTF-16BE, and UTF-16BE". I said 'so far' because I was interested in your statements about 'local encodings' that are for private use in planes 15 and 16. Would you mind telling us a little more about that? I think we're OK from a signature mechanics standpoint, but I just want to be sure that it is safe to push to the application the responsibility for capturing the context of private regions. Thanks, John Boyer Team Leader, Software Development Distributed Processing and XML PureEdge Solutions Inc. Creating Binding E-Commerce v: 250-479-8334, ext. 143 f: 250-479-3772 1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/> > While nothing currently exists out there, This is also not true: there are private use areas allocated in planes 15 and 16. > I think ISO/IEC 10646-2 is supposed to change that fact, so it >would be helpful for us to change our sentence about the conditions >under which we expect the application of Normalization Form C to >occur. This all started with a statement: "REQUIRED to use Normalization Form C [NFC] when converting an XML document to the UCS character domain from a non-Unicode encoding". This was a bit of shorthand on the part of whoever wrote it. Simply change "a non-Unicode encoding" to "any non-UCS encoding" or "any local encoding". >In conclusion, it would be helpful to know whether anyone thinks >UTF-7 >(<http://www.ietf.org/rfc/rfc2152.txt>http://www.ietf.org/rfc/rfc2152.txt) >should be included since it does claim to be a format for encoding >Unicode characters. Oh God no. UTF-7 was a mistake and has, thankfully, never been widely adopted. The only real use of UTF-7 is in IMAP and everyone there deeply regrets it. Pretend that you never heard of UTF-7. --Paul Hoffman, Director --Internet Mail Consortium
Received on Wednesday, 29 November 2000 18:58:59 UTC