Re: Surrogates ?

Mike French scripsit:
> 
> Does Unicode 3.1 require surrogates ?


Yes, or more accurately it provides characters above U+FFFF.


> I know that surrogate encoding schemes have existed for a while (ab initio),
> and technically all Unicode processors should support them,
> but AFAIK there were no actual blocks assigned in the full UCS-4 domain until 3.x.

Until 3.1.

> If supporting 3.1 implicitly requires that UTF-16 and UTF-8 processing supports 
> surrogates, because it has character blocks defined for the full UCS-4 domain,

Not the full UCS-4 domain, only up to U+10FFFF.  Code points beyond that will
never be assigned to characters.

> then this will break a lot of character handling implementations in the real world.
> For example, I bet quite a few UTF-8 converters only handle 1, 2 or 3-byte sequences
> (enough to hold 16-bit data), not the full 6(?) needed for surrogates.

UTF-8 converters should now handle 1, 2, 3, and 4-byte sequences.  There is no
need for 5-byte or 6-byte sequences.

The warnings have been going down since Unicode 2.0.

> And I also know that most Unicode implementations use unsigned short 16-bit
> integers to hold character data, not full 32-bit integers.

Then they can be UTF-16 implementations and be sure to obey UTF-16 rules
(don't split surrogates).

> Anything that hastens the day when surrogates appear in XML,
> either explicitly or implicitly, is a very bad idea !

These characters are already allowed in XML 1.0 character content, and have
been since 1998.  Blueberry is about allowing them in names.

> P.S. Your Unicode link points to XPointer !
>      Is this a circular meta-reference  ???

It's a bug.

-- 
John Cowan                                   cowan@ccil.org
One art/there is/no less/no more/All things/to do/with sparks/galore
	--Douglas Hofstadter

Received on Wednesday, 27 June 2001 21:16:36 UTC