RE: Character Encoding Question

Hi Paul,

Yes, my prior email incorrectly used BMP to refer to the 17 encoding planes
available to UTF-16.  This occured while skimming over UTF-7, which I
assumed was just another way of doing UTF-8 but in fact seems only capable
of encoding the BMP.

So, in any prior email, my statements about the BMP should actually have
been directed to "the set of 17 character planes available to UTF-16".  In
my mind, it would seem that the BMP is misnamed since we seem to need a lot
of stuff outside of it to meet the basic language needs of earthlings.
Perhaps UTF-16 should become the Basic Multilingual "3-space".  :)

Anyway, Jeff's point about UCS-2 != Unicode has now hit home thanks to some
of the examples in the UTF-8 spec.  These examples clearly show triplets of
UCS-2 values being used to form a single character, which does not appear to
be permissible under UTF-16.  Since the Unicode manual is quite clear on the
equivalence between Unicode and UTF-16 (p. 19), this would mean that UCS-2
!= Unicode.

So it would seem that we need to include UCS-2 in the list of things that
should not have NFC applied.

This leads us to your suggestion, which is to say "REQUIRED to use
Normalization Form C [NFC] when converting an XML document to the UCS
character domain from a non-UCS encoding".

Based on what I've read over the last two days, this does not work for me.
The XML standard makes it clear that you can actually encode a document in
native UCS-4.  You do not have to use UTF-8 or UTF-16.  So, the statement
you suggest would be interpreted as requiring NFC on UTF-8 and UTF-16
encodings, since the UCS encoding is something different.

Moreover, taking into account your suggestion to pretend UTF-7 never existed
combined with Tom's suggestion that we should explicitly say what Unicode
means, the following is probably the best so far:

"REQUIRED to use Normalization Form C [NFC] when converting an XML document
to the UCS character domain from an encoding other than a UCS encoding,
UTF-8, UTF-16, UTF-16BE, and UTF-16BE".

I said 'so far' because I was interested in your statements about 'local
encodings' that are for private use in planes 15 and 16.  Would you mind
telling us a little more about that?  I think we're OK from a signature
mechanics standpoint, but I just want to be sure that it is safe to push to
the application the responsibility for capturing the context of private
regions.

Thanks,
John Boyer
Team Leader, Software Development
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>



>   While nothing currently exists out there,

This is also not true: there are private use areas allocated in
planes 15 and 16.

>  I think ISO/IEC 10646-2 is supposed to change that fact, so it
>would be helpful for us to change our sentence about the conditions
>under which we expect the application of Normalization Form C to
>occur.

This all started with a statement:

"REQUIRED to use Normalization Form C [NFC] when converting an XML
document to the UCS character domain from a non-Unicode encoding".

This was a bit of shorthand on the part of whoever wrote it. Simply
change "a non-Unicode encoding" to "any non-UCS encoding" or "any
local encoding".


>In conclusion, it would be helpful to know whether anyone thinks
>UTF-7
>(<http://www.ietf.org/rfc/rfc2152.txt>http://www.ietf.org/rfc/rfc2152.txt)
>should be included since it does claim to be a format for encoding
>Unicode characters.

Oh God no. UTF-7 was a mistake and has, thankfully, never been widely
adopted. The only real use of UTF-7 is in IMAP and everyone there
deeply regrets it. Pretend that you never heard of UTF-7.

--Paul Hoffman, Director
--Internet Mail Consortium

Received on Wednesday, 29 November 2000 18:58:59 UTC