- From: John Cowan <cowan@mercury.ccil.org>
- Date: Wed, 27 Jun 2001 21:16:32 -0400 (EDT)
- To: Mike French <mfrench@atg.com>
- CC: www-xml-blueberry-comments@w3.org
Mike French scripsit: > > Does Unicode 3.1 require surrogates ? Yes, or more accurately it provides characters above U+FFFF. > I know that surrogate encoding schemes have existed for a while (ab initio), > and technically all Unicode processors should support them, > but AFAIK there were no actual blocks assigned in the full UCS-4 domain until 3.x. Until 3.1. > If supporting 3.1 implicitly requires that UTF-16 and UTF-8 processing supports > surrogates, because it has character blocks defined for the full UCS-4 domain, Not the full UCS-4 domain, only up to U+10FFFF. Code points beyond that will never be assigned to characters. > then this will break a lot of character handling implementations in the real world. > For example, I bet quite a few UTF-8 converters only handle 1, 2 or 3-byte sequences > (enough to hold 16-bit data), not the full 6(?) needed for surrogates. UTF-8 converters should now handle 1, 2, 3, and 4-byte sequences. There is no need for 5-byte or 6-byte sequences. The warnings have been going down since Unicode 2.0. > And I also know that most Unicode implementations use unsigned short 16-bit > integers to hold character data, not full 32-bit integers. Then they can be UTF-16 implementations and be sure to obey UTF-16 rules (don't split surrogates). > Anything that hastens the day when surrogates appear in XML, > either explicitly or implicitly, is a very bad idea ! These characters are already allowed in XML 1.0 character content, and have been since 1998. Blueberry is about allowing them in names. > P.S. Your Unicode link points to XPointer ! > Is this a circular meta-reference ??? It's a bug. -- John Cowan cowan@ccil.org One art/there is/no less/no more/All things/to do/with sparks/galore --Douglas Hofstadter
Received on Wednesday, 27 June 2001 21:16:36 UTC