Re: surrogates for XML

Hello Yves,

It's not that difficult, but indeed you have to be careful.

XML is described in terms of characters. A single surrogate
codepoint is definitely no character, so it's excluded from
production [2]. On the other hand, there will be characters
allocated in planes 1, 2,..., and XML is prepared for this.

Looking at UTF-8 and UTF-16, the two encodings every XML
processor is required to understand, A character e.g. in
plane 1 will be encoded as a sequence of four bytes (the
first one of the form 11110xxx). In UTF-16, the same character
will be encoded as a high surrogate followed by a low
surrogate. But this is just the UTF-16-specific way of
encoding, not relevant for XML itself.

Hope this helps.    Regards,   Martin.

At 00/10/09 15:15 +0900, Yves wrote:

>I have a question about Unicode surrogates and XML:
>
>The XML specifications define the range of valid characters to be:
>
>Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
>[#x10000-#x10FFFF]
>
>Explicitely excluding the surrogates blocks. But the scalar values 0x10000 
>to 0x10FFFF seems to indicate that surrogates are supported... I'm not 
>sure I understand. In addition, The Unicode version 3.0 also gives 
>formulas to go back and forth between surrogates pairs and scalar values, 
>mentioning their need for XML (section 3.7).
>
>I would appreciate a lot if someone could someone cold give me more 
>information on how surrogates are supported on XML?
>
>Thanks.
>
>-yves savourel
>
>
>
>
>

Received on Monday, 9 October 2000 08:37:50 UTC