Re: String complements

On Fri, 15 Mar 2024 at 12:42, Norm Tovey-Walsh <norm@saxonica.com> wrote:

> Dimitre Novatchev <dnovatchev@gmail.com> writes:
> > Seems to work like a charm 😀
>
> Doesn’t blindly subtracting the code point for 0x110000 run the risk of
> producing a non-Unicode character? I think the original code point would
> have to be in…checks notes…plane 16, so fairly unlikely, but still…
>

Yes. There are several possibilities w.r.t. possibly invalid or unassigned
Unicode characters:
1. noncharacters: U+nFFFE, U+nFFFF, U+FDD0..U+FDEF [1] [2] -- valid, but
reserved for internal use;
2. control characters: U+0000..U+001F (C0), U+007F, U+0080..U+009F (C1) [2]
-- only the HT, CR, LF, whitespace characters from C0 are allowed in XML
(U+0009, U+000A, U+000D), while DEL (U+007F) and the C1 control characters
are discoraged;
3. UTF-16 surrogate pairs: U+D800..U+DFFF [3] -- these are used to encode
UTF-16 characters above U+010000 and are invalid in any other context;
4. unassigned codepoints -- these are Unicode codepoints that are valid,
but haven't been allocated a character yet;
5. combining characters -- these are characters such as those in the
U+0300..U+036F range that can be combined with other characters to add
umlauts, graves, etc. to the base characters.

There will be other complexities with just subtracting the maximum
codepoint.

It would be better to have something like an array of characters in the
alphabet in order, the inverse of which would be easily computed.

[1] http://www.unicode.org/versions/corrigendum9.html
[2] https://www.w3.org/TR/REC-xml/#NT-Char
[3] https://www.unicode.org/faq/utf_bom.html#utf16-2
[4] https://www.unicode.org/charts/PDF/U0300.pdf


>                                         Be seeing you,
>                                           norm
>
> --
> Norm Tovey-Walsh
> Saxonica
>

Received on Friday, 15 March 2024 13:30:52 UTC