Re: String complements from Dimitre Novatchev on 2024-03-15 (public-xslt-40@w3.org from March 2024)

From: Dimitre Novatchev <dnovatchev@gmail.com>
Date: Thu, 14 Mar 2024 17:01:40 -0700
To: Christian Grün <cg@basex.org>
Cc: Norm Tovey-Walsh <norm@saxonica.com>, "public-xslt-40@w3.org" <public-xslt-40@w3.org>
Message-ID: <CAK4KnZeHV5FQc2UnXq_wuyQ4a0QZ5ZwfzXW-2eYJJQafV4nPmw@mail.gmail.com>
Seems to work like a charm 😀

I took 11 Swedish words in their sorted order from an online Swedish
dictionary (https://en.bab.la/dictionary/swedish-english/s/1) and applied
the sorting expression.

The result is: all the words in the opposite order -- really sweet.

[image: image.png]


Thanks,
Dimitre

On Thu, Mar 14, 2024 at 11:50 AM Dimitre Novatchev <dnovatchev@gmail.com>
wrote:

> > I see; so in a nutshell it’s this?
> >
> >   sort(
> >     $input,
> >     keys := fn { string-to-codepoints(.) ! (0x110000 - .), 0x110000 }
> >   )
>
> Essentially yes. This would work for a binary collation that for each
> contained codepoint cp, also contains the codepoint  0x110000 - cp.
>
> The way to make this work with any collation is not to do the subtraction,
> but to swap every pair of codepoints that are equally away from the median
> of the collation. Thus, take step1 below with this note in mind.
>
> 0x10FFFF is the biggest code for a Unicode 6. 0 character (
> https://unicodebook.readthedocs.io/unicode_encodings.html#utf-8)
>
> Add to it one, and we get the value of  0x110000.
>
> 1. Subtract each character in the string from this value and get a new
> string.
>
> 2. Append this same value at the end of the so produced string.
>
> One could ask "Why should we add a new symbol? (code-point) ?"
>
> Probably for the same reason as to why we are using the empty string ""
> which contains absolutely no codepoints in it... (0 codepoints) That is -
> for convenience and completeness.
>
> Thanks,
> Dimitre
>
> On Thu, Mar 14, 2024 at 10:29 AM Christian Grün <cg@basex.org> wrote:
>
>> I see; so in a nutshell it’s this?
>>
>> sort(
>>   $input,
>>   keys := fn { string-to-codepoints(.) ! (0x110000 - .), 0x110000 }
>> )
>> _________________________________
>>
>> Excellent question.
>>
>> And yes, I gave an improper mapping, the correct one is:
>>
>>
>>
>> "" :             '$' ,
>> S1 : Sn ,
>> S2 : Sn-1 ,
>> .  .  .  .  .  .  .
>> Sk : Sn-k+1 ',
>> .  .  .  .  .  .  .
>>
>> Sn : S1
>>
>> and very importantly, Every mapped string must be appended by the '$'
>> character. Just one ending '$' character.
>>
>> In this way we will have:
>>
>> Z   => "A"||"$",
>> ZZ => "AA"||"$"
>>
>> "A$" > "AA$"   because $ is the biggest symbol and is > "A"
>>
>> Thus "AA$" (the value of the inversion of "ZZ") must be returned before
>> "A$" (the value of the inversion of "Z")
>>
>>
>> Thanks,
>> Dimitre
>>
>>
>>
>> Am 14.03.2024 17:06 schrieb Dimitre Novatchev <dnovatchev@gmail.com
>> <mailto:dnovatchev@gmail.com>>:
>> >    >    For a simplified example, revert("abc") would produce "zyx" .
>> This is doable and really valuable.
>> >
>> >      In what sense is “zyx” the complement of “abc”? Over what set of
>> codepoints and in what collation?
>> >
>> >      I am very skeptical that such a function is well defined across
>> all collations and will always produce a single, correct result in all
>> cases.
>> >
>> >      Can you provide a detailed description of how this would work?
>>
>> Yes, as Michael Kay already explained, this is doable if either: the
>> "biggest" symbol in the collation is not used (which btw happens in some
>> collations, for example the biggest symbol in the English(American)
>> collation is 0xFE) - or add an additional symbol that is "bigger" than any
>> other symbol in the collation.
>>
>> Let us, just for convenience, refer to this special symbol as '$' (this
>> is just a convention on how to refer to this special symbol, not the actual
>> dollar character).
>>
>> Then, if S1, S2, ..., Sn are all n symbols in the collation ordered by
>> their value in the collation,  perform this mapping:
>>
>> "" :             '$' ,
>> S1 : Sn || '$' ,
>> S2 : Sn-1 ||  '$' ,
>> .  .  .  .  .  .  .
>> Sk : Sn-k+1 || '$',
>> .  .  .  .  .  .  .
>>
>> Sn : S1 || '$'
>>
>> And certainly, adding a new symbol to a collation is actually creating a
>> new collation, and this would maybe be the most straight-forward way of
>> inverting strings.
>>
>> We may not even create any new collation, we could just have a convention
>> that a collation named "Inverted" || {Real-Collation-Name} produces the
>> negated comparison results of the ones produced by the
>> {Real-Collation-Name} collation. Or, as I mentioned before, this is the
>> same as "decorating a collation".
>>
>> This is one more way to get rid of the $orders parameter in our current
>> functions.
>>
>> Thanks,
>> Dimitre
>>
>> On Thu, Mar 14, 2024 at 3:24 AM Norm Tovey-Walsh <norm@saxonica.com
>> <mailto:norm@saxonica.com>> wrote:
>> Dimitre Novatchev <dnovatchev@gmail.com<mailto:dnovatchev@gmail.com>>
>> writes:
>> >    This function can easily handle strings - produce a "string
>> complement" in the value space for a particular collation.
>> >
>> >    For a simplified example, revert("abc") would produce "zyx" . This
>> is doable and really valuable.
>>
>> In what sense is “zyx” the complement of “abc”? Over what set of
>> codepoints and in what collation?
>>
>> I am very skeptical that such a function is well defined across all
>> collations and will always produce a single, correct result in all cases.
>>
>> Can you provide a detailed description of how this would work?
>>
>>                                         Be seeing you,
>>                                           norm
>>
>> --
>> Norm Tovey-Walsh
>> Saxonica
>>
>>
>>
>>
>>
>>
>
>
Attachments

image/png attachment: image.png
Received on Friday, 15 March 2024 00:01:57 UTC