RE: Emphasis mark skipping (i18n-action#49 csswg-drafts#839) from Addison Phillips on 2023-10-06 (public-i18n-core@w3.org from October to December 2023)

From: Addison Phillips <addisoni18n@gmail.com>
Date: Fri, 6 Oct 2023 07:52:21 -0700
To: "'Robin Leroy'" <eggrobin@unicode.org>
Cc: <unicoRe@unicode.org>, 'Mark Davis Ⓤ' <mark@unicode.org>, <public-i18n-core@w3.org>, "'Ken Whistler'" <kenwhistler@sonic.net>
Message-ID: <01b101d9f864$b6964620$23c2d260$@gmail.com>
Hi Ken and Robin,

 

Thanks for your notes, which are wholly unsurprising (and it is understood that your responses are also unofficial). As you can see from the discussion thread, there isn’t a strong expectation that the UTC would do a reclassification here. 

 

> Do you need a formal note from the UTC on the question of the General_Category of those characters?

 

Only if there is something materially different in that response, otherwise the CSS folks will get the gist of it from this thread.

 

> I do not know whether this approach is practical for CSS.

 

CSS doesn’t want to be in the business of making lists of characters and their properties. I think you could read this as a request that *Unicode* make such a list/derived property. I note that there is also this CLDR issue we recently filed (which doesn’t seem like a CLDR problem to me): https://unicode-org.atlassian.net/browse/CLDR-17044, about a mapping that CSS maintains of small kana to kana and which looks pretty similar to this.

 

Addison

 

Addison Phillips

Chair (W3C Internationalization WG)

 

Internationalization is not a feature.

It is an architecture.

 

 

 

 

From: Robin Leroy <eggrobin@unicode.org> 
Sent: Thursday, October 5, 2023 4:34 PM
To: Addison Phillips <addisoni18n@gmail.com>
Cc: unicoRe@unicode.org; Mark Davis Ⓤ <mark@unicode.org>; public-i18n-core@w3.org; Ken Whistler <kenwhistler@sonic.net>
Subject: Re: Emphasis mark skipping (i18n-action#49 csswg-drafts#839)

 

Dear Addison,

 

(Trying to reply on this thread to play nice with your trackers, but note that Ken’s email is on the other one)

 

Ken wrote: 

In such cases, once can always start with with General_Category values, but then go on to find the most precise (and elegant) statement of the exception list that applies in a particular use case, and look for ways to future-proof that statement against possible further expansions of the supported repertoire of characters.

Examples of that exercise may be found in the derivations of the Word_Break <https://www.unicode.org/reports/tr29/#Table_Word_Break_Property_Values>  property, which often combines General_Category with Line_Break and other properties to try to get the right sets, with a handful of specific exceptions to get to the extreme corner cases.

 

This is a fine art, and tricky interactions abound; the Word_Break=Numeric property changed from


Line_Break = Numeric
or any of the following:
U+FF10 (０) FULLWIDTH DIGIT ZERO
..U+FF19 (９) FULLWIDTH DIGIT NINE
and not U+066C ( ٬ ) ARABIC THOUSANDS SEPARATOR

 

to


Line_Break = Numeric
or General_Category = Decimal_Number
and not U+066C ( ٬ ) ARABIC THOUSANDS SEPARATOR

 

in Unicode 15.1 in order to avoid changing the actual set of characters with Word_Break=Numeric, as some characters were moved out of Line_Break=Numeric as part of a change to line breaking in some Brahmic scripts (see UTC action item 175-A78 <https://www.unicode.org/L2/L2023/23076.htm#175-A78>  and the document referenced therein).

 

The way the Unicode Standard avoids unnecessary churn for implementers is by publishing data files for these kinds of derived properties, so that implementers just pick up the new data files, and changes to the derivation are only a problem for the maintainers of the Unicode Character Database. This is why for instance we added a new Indic_Conjunct_Break at the last minute in Unicode 15.1, to avoid baking too much property set arithmetic into the grapheme cluster breaking rules (see UTC consensus 176-C26 <https://www.unicode.org/L2/L2023/23157.htm#176-C26>  and the documents referenced nearby).

 

I do not know whether this approach is practical for CSS.

 

Note that the sets of numbers <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BN%7D&g=&i=> , numeric characters <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5CP%7Bnt%3DNone%7D&g=&i=> , characters that behave like numbers for line breaking <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BLB%3DNU%7D&g=&i=> , and characters that behave like numbers for word breaking <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BWB%3DNU%7D&g=&i=>  are all different; it is not altogether surprising that the set of characters that do not show an explicit mark in a span of Japanese wakiten should be different from the general-purpose set of punctuation characters.

 

to-mey-to/to-mah-to

I am surprised that Ken did not write /təˈmeɪtoʊ/təˈmɑːtəʊ/ :-)

 

P.S. These are all my personal opinions -- not some statement endorsed by the UTC.

Likewise I do not speak for a committee that has not met since this was brought up.

Do you need a formal note from the UTC on the question of the General_Category of those characters?

 

Best regards,

 

Robin Leroy
Received on Friday, 6 October 2023 14:52:28 UTC