Re: String complements from Dimitre Novatchev on 2024-03-15 (public-xslt-40@w3.org from March 2024)

From: Dimitre Novatchev <dnovatchev@gmail.com>
Date: Fri, 15 Mar 2024 13:26:08 -0700
To: Christian Grün <cg@basex.org>
Cc: Norm Tovey-Walsh <norm@saxonica.com>, "public-xslt-40@w3.org" <public-xslt-40@w3.org>
Message-ID: <CAK4KnZd=GXmsqzzz2aXdRWQkZ_knfTfGwJDfV2ciSZiBCV77tQ@mail.gmail.com>

On Fri, Mar 15, 2024 at 11:41 AM Christian Grün <cg@basex.org> wrote:

> Before requesting new functions, we should clarify if there’s a problem to
> solve – or, in other words, if at least 2, 3 people believe there's a
> problem.
>
>
>
Everyone who would need to specify a collation - argument calling one of
the many standard functions that accept as argument a collation-name, needs
to know how to choose the right collection, and that is - are all of the
characters he is interested in covered (correctly) by the collections he is
choosing from.

Or should this be done by word of mouth?

SQL Server makes this as easy as:

```
declare @chars table
(
  CodePoint binary(1) primary key,
  Character char(1) collate SQL_SwedishPhone_Pref_CP1_CI_AS
)

declare @codePoint binary(1) = 0x0
while (@codePoint < 255)
begin

 insert into @chars(CodePoint,Character)
 values (@codePoint, cast(@codePoint as char(1)));

 set @codePoint += 1;
end

Select *
from @chars
order by Character
```


But we don't have anything like this.

Should the user be choosing blindly?

Thanks,
Dimitre


> Am 15.03.2024 19:25 schrieb Dimitre Novatchev <dnovatchev@gmail.com>:
> > > Seems to work like a charm 😀
> >
> >   Doesn’t blindly subtracting the code point for 0x110000 run the risk
> of producing a non-Unicode character?
> > I think the original code point would have to be in…checks notes…plane
> 16, so fairly unlikely, but still…
>
> I think there is a more important question: Will this work correctly with
> collations that are not Binary (that means that cp1 > cp2 doesn't guarantee
> that in this collation Char(cp1) > Char(cp2) ).
>
> The answer is no - and we have too-many collations (like any CI
> (case-insensitive) collation) in which 'X' and 'x' are consecutive in the
> sorted character set of this collation.
>
> When sorting using a collation, we must use not the codepoint for a
> character, but its index in the sorted characters of this collation.
>
> This is why it is important to have a function
>
>
>
> *fn:collation-characters($collation-name as xs:string) as xs:string  *
> that returns the sorted (according to this collation) individual
> characters of the collation.
>
> Not to mention that the user, before specifying a collation name to one of
> the variety of functions that take collations as parameters, needs to be
> well-informed of exactly which characters are in this collation.
>
> At present we don't have such a function, and it would  really be very
> useful to have a function like this.
>
>
> Thanks,
> Dimitre
>
> On Fri, Mar 15, 2024 at 5:42 AM Norm Tovey-Walsh <norm@saxonica.com>
> wrote:
>
>> Dimitre Novatchev <dnovatchev@gmail.com> writes:
>> > Seems to work like a charm 😀
>>
>> Doesn’t blindly subtracting the code point for 0x110000 run the risk of
>> producing a non-Unicode character? I think the original code point would
>> have to be in…checks notes…plane 16, so fairly unlikely, but still…
>>
>>                                         Be seeing you,
>>                                           norm
>>
>> --
>> Norm Tovey-Walsh
>> Saxonica
>>
>
>
>
>

Received on Friday, 15 March 2024 20:26:26 UTC