Re: surrogates for XML from Martin J. Duerst on 2000-10-09 (www-international@w3.org from October to December 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Mon, 09 Oct 2000 21:38:52 +0900
To: "Yves" <yves@opentag.com> (by way of "Martin J. Duerst" <duerst@w3.org>), www-international@w3.org
Message-Id: <4.2.0.58.J.20001009213848.00c51710@sh.w3.mag.keio.ac.jp>

Hello Yves,

It's not that difficult, but indeed you have to be careful.

XML is described in terms of characters. A single surrogate
codepoint is definitely no character, so it's excluded from
production [2]. On the other hand, there will be characters
allocated in planes 1, 2,..., and XML is prepared for this.

Looking at UTF-8 and UTF-16, the two encodings every XML
processor is required to understand, A character e.g. in
plane 1 will be encoded as a sequence of four bytes (the
first one of the form 11110xxx). In UTF-16, the same character
will be encoded as a high surrogate followed by a low
surrogate. But this is just the UTF-16-specific way of
encoding, not relevant for XML itself.

Hope this helps.    Regards,   Martin.

At 00/10/09 15:15 +0900, Yves wrote:

>I have a question about Unicode surrogates and XML:
>
>The XML specifications define the range of valid characters to be:
>
>Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
>[#x10000-#x10FFFF]
>
>Explicitely excluding the surrogates blocks. But the scalar values 0x10000 
>to 0x10FFFF seems to indicate that surrogates are supported... I'm not 
>sure I understand. In addition, The Unicode version 3.0 also gives 
>formulas to go back and forth between surrogates pairs and scalar values, 
>mentioning their need for XML (section 3.7).
>
>I would appreciate a lot if someone could someone cold give me more 
>information on how surrogates are supported on XML?
>
>Thanks.
>
>-yves savourel
>
>
>
>
>

Received on Monday, 9 October 2000 08:37:50 UTC