AW: String complements from Christian Grün on 2024-03-14 (public-xslt-40@w3.org from March 2024)

From: Christian Grün <cg@basex.org>
Date: Thu, 14 Mar 2024 17:29:01 +0000
To: Dimitre Novatchev <dnovatchev@gmail.com>, Norm Tovey-Walsh <norm@saxonica.com>
CC: "public-xslt-40@w3.org" <public-xslt-40@w3.org>
Message-ID: <AS4PR09MB5549131F8D0BAA0EF860E64DC7292@AS4PR09MB5549.eurprd09.prod.outlook.com>

I see; so in a nutshell it’s this?

sort(
  $input,
  keys := fn { string-to-codepoints(.) ! (0x110000 - .), 0x110000 }
)
_________________________________

Excellent question.

And yes, I gave an improper mapping, the correct one is:

"" :             '$' ,
S1 : Sn ,
S2 : Sn-1 ,
.  .  .  .  .  .  .
Sk : Sn-k+1 ',
.  .  .  .  .  .  .

Sn : S1

and very importantly, Every mapped string must be appended by the '$' character. Just one ending '$' character.

In this way we will have:

Z   => "A"||"$",
ZZ => "AA"||"$"

"A$" > "AA$"   because $ is the biggest symbol and is > "A"

Thus "AA$" (the value of the inversion of "ZZ") must be returned before "A$" (the value of the inversion of "Z")

Thanks,
Dimitre

Am 14.03.2024 17:06 schrieb Dimitre Novatchev <dnovatchev@gmail.com<mailto:dnovatchev@gmail.com>>:
>    >    For a simplified example, revert("abc") would produce "zyx" . This is doable and really valuable.
>
>      In what sense is “zyx” the complement of “abc”? Over what set of codepoints and in what collation?
>
>      I am very skeptical that such a function is well defined across all collations and will always produce a single, correct result in all cases.
>
>      Can you provide a detailed description of how this would work?

Yes, as Michael Kay already explained, this is doable if either: the "biggest" symbol in the collation is not used (which btw happens in some collations, for example the biggest symbol in the English(American) collation is 0xFE) - or add an additional symbol that is "bigger" than any other symbol in the collation.

Let us, just for convenience, refer to this special symbol as '$' (this is just a convention on how to refer to this special symbol, not the actual dollar character).

Then, if S1, S2, ..., Sn are all n symbols in the collation ordered by their value in the collation,  perform this mapping:

"" :             '$' ,
S1 : Sn || '$' ,
S2 : Sn-1 ||  '$' ,
.  .  .  .  .  .  .
Sk : Sn-k+1 || '$',
.  .  .  .  .  .  .

Sn : S1 || '$'

And certainly, adding a new symbol to a collation is actually creating a new collation, and this would maybe be the most straight-forward way of inverting strings.

We may not even create any new collation, we could just have a convention that a collation named "Inverted" || {Real-Collation-Name} produces the negated comparison results of the ones produced by the {Real-Collation-Name} collation. Or, as I mentioned before, this is the same as "decorating a collation".

This is one more way to get rid of the $orders parameter in our current functions.

Thanks,
Dimitre

On Thu, Mar 14, 2024 at 3:24 AM Norm Tovey-Walsh <norm@saxonica.com<mailto:norm@saxonica.com>> wrote:
Dimitre Novatchev <dnovatchev@gmail.com<mailto:dnovatchev@gmail.com>> writes:
>    This function can easily handle strings - produce a "string complement" in the value space for a particular collation.
>
>    For a simplified example, revert("abc") would produce "zyx" . This is doable and really valuable.

In what sense is “zyx” the complement of “abc”? Over what set of codepoints and in what collation?

I am very skeptical that such a function is well defined across all collations and will always produce a single, correct result in all cases.

Can you provide a detailed description of how this would work?

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Thursday, 14 March 2024 17:29:09 UTC