- From: Martin J. Duerst <duerst@w3.org>
- Date: Mon, 09 Oct 2000 21:38:52 +0900
- To: "Yves" <yves@opentag.com> (by way of "Martin J. Duerst" <duerst@w3.org>), www-international@w3.org
Hello Yves, It's not that difficult, but indeed you have to be careful. XML is described in terms of characters. A single surrogate codepoint is definitely no character, so it's excluded from production [2]. On the other hand, there will be characters allocated in planes 1, 2,..., and XML is prepared for this. Looking at UTF-8 and UTF-16, the two encodings every XML processor is required to understand, A character e.g. in plane 1 will be encoded as a sequence of four bytes (the first one of the form 11110xxx). In UTF-16, the same character will be encoded as a high surrogate followed by a low surrogate. But this is just the UTF-16-specific way of encoding, not relevant for XML itself. Hope this helps. Regards, Martin. At 00/10/09 15:15 +0900, Yves wrote: >I have a question about Unicode surrogates and XML: > >The XML specifications define the range of valid characters to be: > >Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | >[#x10000-#x10FFFF] > >Explicitely excluding the surrogates blocks. But the scalar values 0x10000 >to 0x10FFFF seems to indicate that surrogates are supported... I'm not >sure I understand. In addition, The Unicode version 3.0 also gives >formulas to go back and forth between surrogates pairs and scalar values, >mentioning their need for XML (section 3.7). > >I would appreciate a lot if someone could someone cold give me more >information on how surrogates are supported on XML? > >Thanks. > >-yves savourel > > > > >
Received on Monday, 9 October 2000 08:37:50 UTC