Martin J. Duerst
November 30, 2000
At 00/11/29 10:12 -0800, John Boyer wrote:
 >Actually, it appears that the info on p. 20 of Unicode Standard 3.0 was
 >slightly misleading.  They talk of using UTF-8 as an encoding format.
 >However, while I think of UTF-8 as encoding all of UCS-4, they appear to be
 >only using UTF-8 to encode the portion of UCS-4 that Unicode represents,
 >which is the 16 x 64k character regions that compose the BMP.
 >So, the prior sentence was still sufficient.  The following would appear to
 >do the trick:
 >"use Normalization Form C [NFC] when converting an XML document to the UCS
 >character domain from an encoding other than UCS-4, UTF-8, UTF-16,
 >or UTF-16LE."

I think you are close with the above, but I think you should change it to

"use Normalization Form C [NFC] when converting an XML document to the UCS
character domain from any encoding that is not UCS-based (currently,
encodings include UTF-8, UTF-16, UTF-16BE, and UTF-16LE, UCS-2, and UCS-4)."

Why my change:

- There are also others in the IANA registry
    (look e.g. for 'unicode' or 'iso10646').
- There are things we know apply but we don't want to mention (UTF-7).
- We don't know what other might come up (hopefully none :-).
- UCS-2 is mentioned because it's not the same as UTF-16.

Please feel free to send this to the involved lists for further

Regards,   Martin.

