Re: [csswg-drafts] Consider Canonicalization of language tags in :lang() selector maching (#4154)

Thanks for the opportunity to discuss at TPAC. I'm gonig to add some personal notes here that hopefully will help further discussion.

The section in BCP47 on canonicalization can be found [here](https://tools.ietf.org/html/bcp47#page-1-66). As noted, this process includes several operations for dealing with different types of mapping for grandfathered/redundant tags as well as changes to region or language subtags over time. All of these mappings further matching operations in a positive way and should be part of CSS Selectors.

The canonical form of language tags is without extlangs: this was the intention when extlangs were created. The extlang form exists because there are cases where content authors may find some utility in using them, but BCP47 implementations are encouraged to "lose" the enclosing primary language subtag. CSS may still choose to do differently.

The matching challenge has multiple considerations. I'll try to illustrate using the `yue` (Cantonese) and `zh` (Chinese) subtags. Suppose you have a stylesheet and content like so:

```
:lang(zh) { /* something */ }
:lang(yue) { /*something else */}

<p lang="zh-Hans">...
<p lang="yue">...
```

With the basic canonicalization of language tags, the range `zh` only matches the first <p> and the range `yue` only the second. With the extlang canonicalization, we see the above transformed to the following for matching:

```
:lang(zh) { }
:lang(zh-yue) { }

<p lang="zh-Hans">...
<p lang="zh-yue">...
```

Now the first selector matches *both* content items (the second still only matches one item). This is probably *not* the intention of the author. Obviously, as was pointed out in the meeting, the opposite case exists (if one started with `zh-yue` and wrote styles for `:lang(zh)`).

Note that the *tag* `zh-yue` was registered during the RFC3066 era and is one of the so-called "redundant" grandfathered tags. It's replacement is in the BCP47 registry as `yue` (and not `zh-yue`). However, tags such as `zh-yue-Hant` or `zh-yue-CN` or such are well-formed and valid since RFC4646.

The question of which sort of content-to-stylesheet canonicalization incompatibility to incur is what is at question here. Note there are quite a few macrolanguages in the registry--this affects more than the Chinese complex of languages and the degree to which the extlang form matches or interferes with use varies by the macrolanguages. Arabic or Malay, to pick two examples, probably do not want the extlang form in as many cases as Chinese--the enclosed languages really have different needs.

I will supply additional links and some further guidance anon: I've solicited more input from the CLDR community.

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/4154#issuecomment-532518994 using your GitHub account

Received on Wednesday, 18 September 2019 05:00:03 UTC