- From: Reece Dunn <msclrhd@googlemail.com>
- Date: Fri, 15 Mar 2024 13:30:35 +0000
- To: Norm Tovey-Walsh <norm@saxonica.com>
- Cc: Dimitre Novatchev <dnovatchev@gmail.com>, Christian Grün <cg@basex.org>, public-xslt-40@w3.org
- Message-ID: <CAGdtn26iN-mxWm44hWdeEiVxrBCkSBSg3LWJ67GLGTOK_7uO4A@mail.gmail.com>
On Fri, 15 Mar 2024 at 12:42, Norm Tovey-Walsh <norm@saxonica.com> wrote: > Dimitre Novatchev <dnovatchev@gmail.com> writes: > > Seems to work like a charm 😀 > > Doesn’t blindly subtracting the code point for 0x110000 run the risk of > producing a non-Unicode character? I think the original code point would > have to be in…checks notes…plane 16, so fairly unlikely, but still… > Yes. There are several possibilities w.r.t. possibly invalid or unassigned Unicode characters: 1. noncharacters: U+nFFFE, U+nFFFF, U+FDD0..U+FDEF [1] [2] -- valid, but reserved for internal use; 2. control characters: U+0000..U+001F (C0), U+007F, U+0080..U+009F (C1) [2] -- only the HT, CR, LF, whitespace characters from C0 are allowed in XML (U+0009, U+000A, U+000D), while DEL (U+007F) and the C1 control characters are discoraged; 3. UTF-16 surrogate pairs: U+D800..U+DFFF [3] -- these are used to encode UTF-16 characters above U+010000 and are invalid in any other context; 4. unassigned codepoints -- these are Unicode codepoints that are valid, but haven't been allocated a character yet; 5. combining characters -- these are characters such as those in the U+0300..U+036F range that can be combined with other characters to add umlauts, graves, etc. to the base characters. There will be other complexities with just subtracting the maximum codepoint. It would be better to have something like an array of characters in the alphabet in order, the inverse of which would be easily computed. [1] http://www.unicode.org/versions/corrigendum9.html [2] https://www.w3.org/TR/REC-xml/#NT-Char [3] https://www.unicode.org/faq/utf_bom.html#utf16-2 [4] https://www.unicode.org/charts/PDF/U0300.pdf > Be seeing you, > norm > > -- > Norm Tovey-Walsh > Saxonica >
Received on Friday, 15 March 2024 13:30:52 UTC