- From: Dimitre Novatchev <dnovatchev@gmail.com>
- Date: Fri, 15 Mar 2024 08:47:17 -0700
- To: Christian Grün <cg@basex.org>
- Cc: Norm Tovey-Walsh <norm@saxonica.com>, "public-xslt-40@w3.org" <public-xslt-40@w3.org>
- Message-ID: <CAK4KnZcHyH9AC=0M5O3A+CAKaWCNLkx=xbh2s_QXe_3rF=XcRA@mail.gmail.com>
> If we compare strings as a plain sequence of (codepoint) integers, it shouldn’t matter > whether the subtracted value is legal Unicode. > That’s no option, however, as soon as collations come into play. Yes, and this is what I mentioned as a Note in an earlier message. It is quite straightforward, given the sorted characters of a collation, to revert a single character to its symmetrical one (opposite) from the median of the collation. I have a question to Michael Kay, Christian Gruen and other implementors: Where can one find a list of all collation names and get the full sorted list of characters for each collation? I haven't seen such a collation-names list in any documentation either for Saxon or BaseX. I have a full list of collation names used in SQL Server - https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16 but these are not the names used in Saxon. Also, the collation names used by SQL server have standard substrings (in the name) that define: Case Sensitivity (_CS), Accent Sensitivity (_AS), Kana Sensitivity (_KS), Width Sensitivity (_WS), Variation Selector Sensitivity (_VSS), Binary (_BIN), Binary-code point (_BIN2) and UTF-8 (_UTF8). I have only seen that Saxon uses this collation name for Swedish: " http://www.w3.org/2013/collation/UCA?lang=se", but this gives us absolutely no hint about the case sensitivity of the collation, or any other sensitivity. Compare with the ones from SQL Server: - *SQL_SwedishPhone_Pref_CP1_CI_AS* - Finnish-Swedish, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 184 on Code Page 1252 for non-Unicode Data - *SQL_SwedishStd_Pref_CP1_CI_AS* - Finnish-Swedish, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 185 on Code Page 1252 for non-Unicode Data However, the only collation name for Swedish that I could find being used in an example of the F&O document was: "http://www.w3.org/2013/collation/UCA?lang=se" For SQL Server one can immediately get all data available about all collations using just a simple query: *SELECT name, description FROM sys.fn_helpcollations()* *ORDER BY name;* This returns 3955 rows - each row for a separate collation. Do we have a tool like this in our XPath processors? If we know the collation name and have standard functions such as: - *fn:collation-characters($collation-name as xs:string) as array(xs:integer) * - returns the code-points of all characters contained in the collation having the name $collation-name - in the sorted order (defined by the collation) of the characters. - *fn:index-in-collation( **$collation-name as xs:string, $character as xs:string) as xs:integer* -returns the index of the string-to-codepoints($character) in the array produced by fn:collation-characters($collation-name) Then it is a one-liner to invert any string of characters in that same collation. Because all collations have been established, and are not updatable, the above functions don't need to calculate anything dynamically during run-time. And thus - finally - have the function *fn:invert-string($input as xs:string, * * $collation-name as xs:string) as xs:string* Thanks, Dimitre On Fri, Mar 15, 2024 at 7:48 AM Christian Grün <cg@basex.org> wrote: > @Norm The following function calls should all yield the same result: > > sort($input) > sort($input, keys := identity#1) > sort($input, keys := string-to-codepoints#1) > > Instead of subtracting 0x110000, we could as well use the negative value: > > sort($input, keys := fn { string-to-codepoints(.) ! (-.), 1 }) > > If we compare strings as a plain sequence of (codepoint) integers, it > shouldn’t matter whether the subtracted value is legal Unicode. That’s no > option, however, as soon as collations come into play. > > @Liam > > • I believe that Dimitre was looking for alternatives for his fn:ranks > approach. > • I assume that most users will prefer sort($input, orders = 'descending') > or reverse(sort($input)). > • We have added sort-with($input, $comparators) a short while ago. > >
Received on Friday, 15 March 2024 15:47:34 UTC