FW: Emphasis skip property? [W3C I18N Action #99]

FYI… for the archive

 

From: Ken Whistler <kenwhistler@sonic.net> 
Sent: Tuesday, May 28, 2024 11:30 AM
To: Addison Phillips <addisoni18n@gmail.com>; 'Robin Leroy' <eggrobin@unicode.org>
Cc: 'Mark Davis Ⓤ' <mark@unicode.org>; 'Markus Scherer' <markus.icu@gmail.com>; petercon@unicode.org; craig@unicode.org; asmus@unicode.org; public-i18n-core@w3.org; 'Florian Rivoal' <florian@rivoal.net>; fantasai@inkedblade.net; unicoRe@unicode.org
Subject: Re: Emphasis skip property? [W3C I18N Action #99]

 

Addison,

I have some further comments interspersed below.

On 5/26/2024 12:19 PM, Addison Phillips wrote:

I want to call out that, while this seems to answer the question in our specific case, 

Note that even in the wakiten case, this isn't really the end of the discussion. There was a long thread about this in a Property & Algorithms Group (PAG) issue. For wakiten, it is fairly trivial to enumerate the list of ASCII characters (and a couple others) that require special behavior that isn't captured by a binary gc=P versus gc=S distinction based on the Unicode General_Category property. And PAG (and the UTC) don't want to get into the business of trying to catalog and enumerate all the special use cases, particularly for legacy ASCII characters, defined in a wide variety of protocols and other specifications.

The UTC also doesn't want to get into the business of trying to refine the General_Category into some more "perfect" classification that will fix more edge cases for external specifications. We locked the door on sub-classification of gc years ago precisely because further tinkering with it was so destabilizing. And adding hacks like "Punctuation_That_Is_Really_A_Symbol" or what not is just hacking at the issue to get around the fact that gc is a *general* classification of characters, and known not to be either exclusive or perfect for all uses.

Where there is room for UTC action in this area, I think, is for the UTC to provide further guidance about *non*-ASCII characters that are "sufficiently like" ASCII (or Latin-1) characters that specifications should probably consider them for special-casing along with the core characters they really care about. Thus, for wakiten, it is fairly easy to say [# % & @ § ¶] are special, but it gets more arcane to spell out which *other* of the many thousands of Unicode characters are sufficiently like "%", for example, to get similar treatment. I don't think we should be distributing the burden of that kind of determination out to hundreds of protocol and specification editors.

it doesn’t answer the more general question of how to approach Unicode with requests for issues such as character property management. 

In general, there are three ways to proceed:

1. Make an explicit proposal (to add a property, modify a property, etc. for the UCD) to the UTC. Such proposals get routed to PAG for detailed discussion. Proponents can engage in that discussion if they want, and may get suggestions for how to improve a proposal, or may be rebuffed if PAG determines there are real problems with the proposal, or PAG may suggest alternative approaches, and so on. Adding a new property needs to pass a fairly high bar, in part because of the ongoing, permanent maintenance burden.

2. Write up a Unicode Technical Note (UTN). These documents do not require UTC approval, and as long as they don't recommend non-conformant behavior, can pretty much address whatever the author wants. We have taken to recommending this approach for people to fill out details about particular script behavior and rendering, for example, but this route has also been used to fill out details and edge case for particular properties. See, for example, UTN #43 about the kStrange property:

https://www.unicode.org/notes/tn43/

A UTN about wakiten (and related emphasis behavior in East Asian typography) would be perfectly appropriate, and could provide as much context and as much discussion about exceptional character cases and how they are related to existing properties as necessary. Then a CSS specification could simply refer to that UTN for details.

3. Write up a Unicode Technical Report (UTR). These are specifications whose content *is* maintained by the UTC and which require a formal approval process. The most obviously relevant instance of this for the case we are discussing in UTR #25, Unicode Support for Mathematics:

 https://www.unicode.org/reports/tr25/

which among many other things defines a collection of mathematical character properties. Those properties are not maintained as part of the Unicode Character Database, per se, but are maintained as part of the data posted on the website and maintained in the context of UTR #25.

A new UTR is a heavy lift, but it could be appropriate if some some property issue were of general-enough interest that an entire new specification devoted to it made sense.

 

* W3C may not want to maintain it, but that does not mean that the UTC wants to do so—nor, more relevantly, that it has the resources to do so.

 

I will note that the UTC is more generally in the business of managing lists of character properties and that TUS (and related materials, such as the UCD) is a better place for information about characters _in general_ than having various W3C Specifications and Notes try to manage it. For one thing, the Web is not the only application that might care about such things, our standards are not the obvious place to look, and we are not probably the right standards body to make decisions like this. 

True enough (and see above). But there is a line to be negotiated here.

The UTC should probably not define a character property: Is_Localpart_Domain_Delimiter and assign that to one character, U+0040 "@". That is, instead, the business of RFC 5321.

And nobody would expect W3C to get into the business of defining the content of Indic_Syllabic_Category=Virama, for example.

But there is a gray area in between, where various groups have to figure out which lists belong where.

 

If there is a concern about human resources available for tasks that we’re requesting, I will note that all of the directly copied W3C folks have done work in the Unicode space in the past. We are not proposing an “unfunded mandate”. We just want to ensure that the work is done in the right way in the right place, where the results will be accessible and properly maintained. 

Yep, I agree.

--Ken

 

Received on Thursday, 30 May 2024 15:01:55 UTC