Re: String complements from Dimitre Novatchev on 2024-03-15 (public-xslt-40@w3.org from March 2024)

From: Dimitre Novatchev <dnovatchev@gmail.com>
Date: Fri, 15 Mar 2024 08:47:17 -0700
To: Christian Grün <cg@basex.org>
Cc: Norm Tovey-Walsh <norm@saxonica.com>, "public-xslt-40@w3.org" <public-xslt-40@w3.org>
Message-ID: <CAK4KnZcHyH9AC=0M5O3A+CAKaWCNLkx=xbh2s_QXe_3rF=XcRA@mail.gmail.com>
> If we compare strings as a plain sequence of (codepoint) integers, it
shouldn’t matter
> whether the subtracted value is legal Unicode.
> That’s no option, however, as soon as collations come into play.

Yes, and this is what I mentioned as a Note in an earlier message.

It is quite straightforward, given the sorted characters of a collation, to
revert a single character to its symmetrical one (opposite) from the median
of the collation.

I have a question to Michael Kay, Christian Gruen and other implementors:

Where can one find a list of all collation names and get the full sorted
list of characters for each collation?

I haven't seen such a collation-names list in any documentation either for
Saxon or BaseX.

I have a full list of collation names used in SQL Server -
https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16

but these are not the names used in Saxon. Also, the collation names used
by SQL server have standard substrings (in the name) that define: Case
Sensitivity (_CS), Accent Sensitivity (_AS), Kana Sensitivity (_KS), Width
Sensitivity (_WS), Variation Selector Sensitivity (_VSS), Binary (_BIN),
Binary-code point (_BIN2) and UTF-8 (_UTF8).

I have only seen that Saxon uses this collation name for Swedish: "
http://www.w3.org/2013/collation/UCA?lang=se", but this gives us absolutely
no hint about the case sensitivity of the collation, or any other
sensitivity. Compare with the ones from SQL Server:


   - *SQL_SwedishPhone_Pref_CP1_CI_AS*   - Finnish-Swedish,
   case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
   for Unicode Data, SQL Server Sort Order 184 on Code Page 1252 for
   non-Unicode Data
   - *SQL_SwedishStd_Pref_CP1_CI_AS* - Finnish-Swedish, case-insensitive,
   accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data,
   SQL Server Sort Order 185 on Code Page 1252 for non-Unicode Data

However, the only collation name for Swedish that I could find being used
in an example of the F&O document was:

"http://www.w3.org/2013/collation/UCA?lang=se"



For SQL Server one can immediately get all data available about all
collations using just a simple query:

*SELECT name, description FROM sys.fn_helpcollations()*
*ORDER BY name;*


This returns 3955 rows - each row for a separate collation.

Do we have a tool like this in our XPath processors?

If we know the collation name and have standard functions such as:


   - *fn:collation-characters($collation-name as xs:string) as
   array(xs:integer) * - returns the code-points of all characters
   contained in the collation having the name $collation-name - in the sorted
   order (defined by the collation) of the characters.
   - *fn:index-in-collation( **$collation-name as xs:string, $character as
   xs:string) as xs:integer* -returns the index of the
   string-to-codepoints($character) in the array produced by
   fn:collation-characters($collation-name)

Then it is a one-liner to invert any string of characters in that same
collation.

Because all collations have been established, and are not updatable, the
above functions don't need to calculate anything dynamically during
run-time.

And thus - finally -  have the function *fn:invert-string($input as
xs:string, *
* $collation-name as xs:string) as xs:string*
Thanks,
Dimitre


On Fri, Mar 15, 2024 at 7:48 AM Christian Grün <cg@basex.org> wrote:

> @Norm The following function calls should all yield the same result:
>
>   sort($input)
>   sort($input, keys := identity#1)
>   sort($input, keys := string-to-codepoints#1)
>
> Instead of subtracting 0x110000, we could as well use the negative value:
>
>   sort($input, keys := fn { string-to-codepoints(.) ! (-.), 1 })
>
> If we compare strings as a plain sequence of (codepoint) integers, it
> shouldn’t matter whether the subtracted value is legal Unicode. That’s no
> option, however, as soon as collations come into play.
>
> @Liam
>
> • I believe that Dimitre was looking for alternatives for his fn:ranks
> approach.
> • I assume that most users will prefer sort($input, orders = 'descending')
> or reverse(sort($input)).
> • We have added sort-with($input, $comparators) a short while ago.
>
>
Received on Friday, 15 March 2024 15:47:34 UTC