Re: [csswg-drafts] Consider Canonicalization of language tags in :lang() selector maching (#4154)

The CSS Working Group just discussed `Consider Canonicalization of language tags in :lang() selector maching`, and agreed to the following:

* `RESOLVED: Accept already-merged edits`

<details><summary>The full IRC log of that discussion</summary>
&lt;TabAtkins> florian: we have a :lang() selector that takes a language tag, letting yous tyuloe based on lang<br>
&lt;TabAtkins> florian: one complication is there a re multiple ways to express the same language<br>
&lt;TabAtkins> florian: writing a tiny page on your own, you just pick one and use that tag<br>
&lt;TabAtkins> florian: but if you have a large team, or writing generic library CSS, you don't necessarily know which lang tag the page author will use<br>
&lt;TabAtkins> florian: there exists canonicalization mechanisms we could use, so two different forms for the same lang will match<br>
&lt;TabAtkins> florian: but there are multiple ways to do it and they give different results<br>
&lt;TabAtkins> florian: not sure i18n progress is<br>
&lt;TabAtkins> florian: before today, Addison and I had opposite views of ideal<br>
&lt;TabAtkins> florian: start with chinese<br>
&lt;TabAtkins> florian: Cantonese can be expressed as yue<br>
&lt;TabAtkins> florian: or as zh-yue<br>
&lt;TabAtkins> florian: the standard canonicalization takes za-yue into yue<br>
&lt;TabAtkins> florian: another way does the opposite<br>
&lt;TabAtkins> s/za-yue/zh-yue/<br>
&lt;TabAtkins> ChrisL: is this bidirectional, or are there many-to-one relationships?<br>
&lt;TabAtkins> florian: answer that in a sec<br>
&lt;TabAtkins> florian: if your page has zh-yue, and you have :lang(yue), it doesn't matter which direction we canonicalized for matching<br>
&lt;TabAtkins> florian: but if you also have a :lang(zh) selector, if we canon zh-yue into yue, the selector doens't match<br>
&lt;TabAtkins> florian: is that good? or bad?<br>
&lt;TabAtkins> florian: I'd argue it's likely bad<br>
&lt;TabAtkins> florian: from CSS pov we're mostly dealing with typography here. if you have a selector for Chinese langs, it's desirable that it applies to Cantonese even if it's not tagged with zh. but if zh- *is* prefixed there, you definitely want it<br>
&lt;TabAtkins> florian: zh-yue is more specific than zh, so maybe more accurate, but it's still a Chinese lang and we shouldn't lose that<br>
&lt;castastrophe> q+<br>
&lt;TabAtkins> florian: so my opinion is we canonicalized yue to zh-yue. so if you're being specific it stays specific, and if you're generic with just zh, it'll match that too<br>
&lt;TabAtkins> florian: so I think we should canonicalized to the extended form and match on that<br>
&lt;TabAtkins> florian: but I think Addison was arguing the opposite, just canonicalized and don't convert ot extended form<br>
&lt;TabAtkins> addison: there is an ex-lang form, it's somewhat deprecated<br>
&lt;TabAtkins> addison: most places, in bcp47 it says canonicalization should remove the primary, so zh-yue becomes yue<br>
&lt;TabAtkins> addison: afaik most mechanism that work with bcp47 follow that canonicalization<br>
&lt;TabAtkins> addison: the challenges you point out are real<br>
&lt;TabAtkins> addison: I think Chinese is a particularly complex space<br>
&lt;TabAtkins> addison: if you look at the other ex-langs, you won't want the ex-lang form<br>
&lt;emilio> FWIW we changed behavior here not that long ago: https://bugzilla.mozilla.org/show_bug.cgi?id=2003721<br>
&lt;TabAtkins> addison: but Chinese is complex because it has non-mandarin languages with subtags, and script variation, and a few regional codes<br>
&lt;florian> q+<br>
&lt;TabAtkins> addison: so you can get a whole forest of different kinds of tags that are similar but different, and canonicalizing can be hard in a matching situations<br>
&lt;emilio> q+<br>
&lt;TabAtkins> addison: you want something reliable so sheet and page authors know what should work<br>
&lt;TabAtkins> addison: generally, mechanism that exists out there are to chop off primaries from the ex-lang form and simplify it<br>
&lt;TabAtkins> addison: because you don't really mean the primary language in that case, you mean the ex-lang<br>
&lt;TabAtkins> addison: there are also concepts from JS from cldr, which adds likely subtags and then matches<br>
&lt;TabAtkins> addison: that requires additional data and is more complex. might be worth discussing<br>
&lt;TabAtkins> addison: vs just bcp47 which is just string matching<br>
&lt;TabAtkins> addison: so the q is to stick to just that or use something more complex<br>
&lt;r12a> q+<br>
&lt;astearns> ack castastrophe<br>
&lt;TabAtkins> castastrophe: my use-case is I do a lot of styling with CMSes, so my connection to HTML is very limited, but I'm writing styles for generic pages. don't know specifically the amrkup<br>
&lt;TabAtkins> castastrophe: so 9 times out of 10 when styling for i18n, I don't know the exact lang tag the CMS is gonna use, depends on the db engineers<br>
&lt;TabAtkins> castastrophe: so my thought process is having a wide group - so if HTML has a specific lang but i'm styling a group of langs with similar norms, I can match on the generic<br>
&lt;TabAtkins> castastrophe: so I'm not familiar with all lang groups, but if there are similar styles you'da pply to the language group we should make sure you can match on the generic form<br>
&lt;astearns> ack fantasai<br>
&lt;TabAtkins> fantasai: I think we have a clear understanding of the CSS use-cases<br>
&lt;TabAtkins> fantasai: what is the reasoning that drove the canonicalization decisions for bcp47?<br>
&lt;TabAtkins> fantasai: were they thinking like, "oh, same info in less space"? or did they have particular impacts of matching behavior that had better use-cases?<br>
&lt;TabAtkins> addison: the thinking with macro languages is, generally there's a dominant form<br>
&lt;TabAtkins> addison: not always true but usually<br>
&lt;TabAtkins> addison: in zh, cmn is the dominant form, especially for writing<br>
&lt;TabAtkins> (cmn Mandarin Chinese)<br>
&lt;TabAtkins> addison: so when you see cmn you want to get rid of it and just use zh<br>
&lt;TabAtkins> addison: for all the others, if you really want Cantonese, you really mean yue, not cmn (or the generic zh)<br>
&lt;TabAtkins> addison: this is also true of the other macro languages in general, you want to get rid of the generic tag when there's a specific lang<br>
&lt;TabAtkins> fantasai: what's the benefit of removing the generic tag for the minority langs, rather than just saving space?<br>
&lt;TabAtkins> addison: the reason is you wanted to get out of the forest of subtags<br>
&lt;TabAtkins> fantasai: that sounds like a theoretical use-case. what's the concrete use-case that's solved by removing it?<br>
&lt;TabAtkins> addison: one problem we get into is locale fallback chains. when you have subtags at some point you take it away and you're looking at a different lang (cmn)<br>
&lt;TabAtkins> florian: the thing is we're not canonicalizing in general, we're doing specific purposes<br>
&lt;TabAtkins> florian: we won't say "your tag is too specific, we'll chop bits off"<br>
&lt;TabAtkins> addison: that's kinda what you're doing - you do prefix match and I think extended filtering?<br>
&lt;TabAtkins> addison: so you do subtags matches to the tag, it's pure string matching<br>
&lt;TabAtkins> florian: if the *selector* is :lang(zh-yue) or :lang(yue), and the doc is generic zh, it shouldn't match. that's current spec and should stay<br>
&lt;TabAtkins> florian: but the other way around, :lang(zh) and your document is yue, it should match<br>
&lt;TabAtkins> florian: Cantonese is more specific than zh, yes, but if your selector says :lang(zh) you probably aren't referring to that specifically.<br>
&lt;ChrisL> q+ to wonder about en, en-gb, en-us, en-za etc<br>
&lt;TabAtkins> florian: you're almost certainly trying to set styles for all Chinese, like a font-size adjustment to align it with neighboring English or something<br>
&lt;TabAtkins> florian: so despite what you say, I think in our case it does make sense to use the ex-lang form<br>
&lt;TabAtkins> florian: the recommendations to go to the non-exlang form, I recognize them but I don't think they apply to our case<br>
&lt;astearns> ack florian<br>
&lt;dbaron> (I think Florian said "all CJK" rather than "all Chinese" though I'm not sure.)<br>
&lt;TabAtkins> florian: another thing, the behavior diff is author  visible<br>
&lt;TabAtkins> florian: the string itself won't necessarily be, but we dont' save that canonicalized form anywhere<br>
&lt;TabAtkins> florian: I think you said something about author being confused by seeing a longer from from canonicalization, the author won't see that. only the matching behavior is affected, internal logic<br>
&lt;TabAtkins> florian: you also mentioned CLDR for extra matching behavior. i'm not familiar enough to say anything those, so we might want to come back to those later.<br>
&lt;astearns> ack emilio<br>
&lt;addison> q+<br>
&lt;TabAtkins> emilio: wanted to point out that depending which way we go, HTML spec has rules for :lang(zh)<br>
&lt;TabAtkins> emilio: which ones should apply to yue?<br>
&lt;TabAtkins> fantasai: all of them<br>
&lt;TabAtkins> emilio: ok if the answer is all of them it seems clear they should be expanded<br>
&lt;TabAtkins> florian: which rules?<br>
&lt;TabAtkins> emilio: some ruby alignment, some text decoration inset...<br>
&lt;TabAtkins> florian: yes, all of those should apply to Mandarin too<br>
&lt;castastrophe> q+<br>
&lt;TabAtkins> emilio: okay then yes, I think expanding makes sense. otherwise we need to fix html<br>
&lt;astearns> ack r12a<br>
&lt;TabAtkins> r12a: I think I understood the issue from florian's description, and i'm inclined to agree with him<br>
&lt;astearns> ack ChrisL<br>
&lt;Zakim> ChrisL, you wanted to wonder about en, en-gb, en-us, en-za etc<br>
&lt;TabAtkins> r12a: for the i18n folks' benefit, we didn't discuss this specific angle at the previous meeting. i'll leave it at that<br>
&lt;TabAtkins> ChrisL: moving away from Chinese, you can have "English", "British English", "American English", and you take away the "English" to focus on just the British/American. But you forgot about south-african English, so now it doesn't get styled like other Englishes.<br>
&lt;TabAtkins> ChrisL: so I agree with Elika, chopping the generic tag doesn't gain us anything there<br>
&lt;astearns> ack fantasai<br>
&lt;Zakim> fantasai, you wanted to comment on assuming zh-cmn<br>
&lt;TabAtkins> fantasai: so Addison was talking about doing locale fallbacks, you chop off tags. if you chop off enough you land on zh, which "means" zh-cmn.<br>
&lt;jsahleen> q+<br>
&lt;astearns> ack addison<br>
&lt;TabAtkins> fantasai: but you shouldn't be meaning that. that only applies if you see a zh with no other context. If you're doing locale fallback, when you chop back to zh, you should treat it as generic Chinese, not specific Mandarin. That equivalence makes sense elsewhere<br>
&lt;TabAtkins> addison: I hear those arguments, sounds valid. I'm a little leery about non-chinese langs with a macro language field in the subtags registry<br>
&lt;TabAtkins> addison: your mechanism would internally introduce some subtags that are maybe more questionable<br>
&lt;florian> qq+<br>
&lt;TabAtkins> addison: like Malaysian has a number of macro languages...<br>
&lt;TabAtkins> addison: I worry this introduces some matching you might not want<br>
&lt;TabAtkins> addison: but in general I think your argument makes sense to me<br>
&lt;TabAtkins> addison: but my main call-out is that it's different from what we see in lang-tag handling almost everywhere else<br>
&lt;TabAtkins> addison: where the first thing we often do is step on the sublang to make it primary<br>
&lt;TabAtkins> addison: teaching that CSS is different... I'm struggling to think of a use-case where that's problematic<br>
&lt;r12a> in case it's useful: https://r12a.github.io/app-subtags/<br>
&lt;TabAtkins> addison: if think if you say :lang(zh) and it matches tags that explicitly aren't zh (they're "yue" or something), the matching might be unexpected<br>
&lt;astearns> ack florian<br>
&lt;Zakim> florian, you wanted to react to addison<br>
&lt;TabAtkins> addison: so we should clarify that behavior exists<br>
&lt;TabAtkins> florian: clarification - on the Chinese case I do think we want it, but you mention other groups<br>
&lt;TabAtkins> florian: what makes Chinese convenient for us is they have a shared typographic convention. written Chinese are all very similar.<br>
&lt;TabAtkins> florian: are there macro tags that group langs that are radically different?<br>
&lt;TabAtkins> addison: Serbo-croation is the big obvious one<br>
&lt;TabAtkins> ChrisL: either Cyrillic or Latin<br>
&lt;florian> q+<br>
&lt;TabAtkins> addison: they're generally tagged differently these days. pulling in the macro language puts the sh tag back, which you probably don't want<br>
&lt;astearns> ack castastrophe<br>
&lt;TabAtkins> castastrophe: i'm wondering if it's valuable to put together a table with languages that have common styling<br>
&lt;TabAtkins> castastrophe: if I'm writing common styles and want to cover the widest future use-case, my team might not support certain langs yet but may support in the future, I want to write styles generically that'll handle that well in the future<br>
&lt;dbaron> (I think there are other examples of languages written in different scripts in different places, although I'm not sure how each case is reflected in language tags...)<br>
&lt;TabAtkins> castastrophe: so if I have specific styles associated with those lang groups, it would be helpful for me to understand what wer'e talking about<br>
&lt;TabAtkins> castastrophe: so I know how to apply *this* CSS property to *this* group of langs...<br>
&lt;TabAtkins> castastrophe: versus like using a specific font-family to Serbian, might not want that for the whole group<br>
&lt;r12a> q+<br>
&lt;TabAtkins> castastrophe: so if it seems reasonable I could put together the langs we use internally, and people could comment what kind of styling would reasonably apply to those groups, maybe it would help?<br>
&lt;TabAtkins> astearns: might be good, as Emilio pointed out, to start with the HTML UA sheet<br>
&lt;astearns> ack jsahleen<br>
&lt;TabAtkins> jsahleen: is there any potential for danger in having CSS do it one way and JS do it another?<br>
&lt;TabAtkins> jsahleen: as Addison pointed out, there's an industry-standard way to do locale resolution and fallbacks, can you get into a case where you CSS is specificying something different than what your JS is working with?<br>
&lt;TabAtkins> castastrophe: jumping the queue, i've never personally had to do something lang-specific in JS. anyone have an example of that, besides what's useful for styling?<br>
&lt;TabAtkins> fantasai: I suspect there's some locale handling...<br>
&lt;TabAtkins> bramus: currency, datetime formatting, but there's already APIs for that<br>
&lt;TabAtkins> astearns: to Joel's point, if JS's i18n are using lang tags different from us, that might be...<br>
&lt;TabAtkins> fantasai: locale matching is different. you're not saying "is this a variant of X, so I can style it", you're saying "give me the closest thing to the lang that I really want"<br>
&lt;astearns> ack fantasai<br>
&lt;Zakim> fantasai, you wanted to ask about script subtags<br>
&lt;TabAtkins> fantasai: if I want "British English in the Hixie style", if you have it you'll give it to me, or you'll fall back to British English, failing that back to "plain" English that might bias towards American English, etc.<br>
&lt;castastrophe> How does AI automatic translations feed into this conversation? When the browser auto-translates, are they attaching new lang tags?<br>
&lt;TabAtkins> fantasai: you're taking a specific variant and finding, among the available options, the closest variant to that<br>
&lt;TabAtkins> fantasai: here's we have a specific lang variant and asking if it's close enough to this (possibly generic) request<br>
&lt;TabAtkins> addison: yeah you're doing a different kind of operation. you were describing lookup (per BCP), but CSS is doing filtering, to find all things that could match the specific range<br>
&lt;florian> q?<br>
&lt;astearns> ack dbaron<br>
&lt;TabAtkins> dbaron: we've been talking a bit about what the things in CSS that use langs are; I think we've been focusing mostly on :lang()<br>
&lt;TabAtkins> dbaron: and that's mostly fonts and typography stuff<br>
&lt;TabAtkins> dbaron: one other thing elsewhere in CSS is hyphenation support<br>
&lt;TabAtkins> dbaron: which I think is very different in what level of specificity it wants<br>
&lt;TabAtkins> dbaron: I think hyphenation generally wants the lang pretty specifically<br>
&lt;castastrophe> Re: dbaron ’s point here, I have attached color and padding logic to JA language groups; so not just typography<br>
&lt;TabAtkins> dbaron: unsure how the macro langs map to hyphenation distinctions [gives an example I can't transcribe]<br>
&lt;castastrophe> Some design system best practices do differ culturally as well<br>
&lt;TabAtkins> dbaron: but it seems for a lot of fonts things, the mechanism in CSS is language matching, but in many cases they usually want Script matching instead<br>
&lt;TabAtkins> dbaron: Like florian was saying "i'll write out zh, JS, ko" but really you want a script descriptor<br>
&lt;fantasai> (or three script descriptors really)<br>
&lt;TabAtkins> dbaron: so how much should we be designing lang-matching things for things where people should instead be using SCript matching<br>
&lt;astearns> ack fantasai<br>
&lt;addison> q+<br>
&lt;astearns> ack florian<br>
&lt;astearns> q+ fantasai<br>
&lt;astearns> ack fantasai<br>
&lt;TabAtkins> florian: the q is indeed about :lang(), but we do have several questions which might not have the same answer. but this q is :lang() specifically<br>
&lt;TabAtkins> florian: going back to Addison's point... you said you write a document and said "yue" you don't want to match with...<br>
&lt;astearns> zakim, close queue<br>
&lt;Zakim> ok, astearns, the speaker queue is closed<br>
&lt;TabAtkins> florian: but you don't write for matching when you write the document. you're describing. the person writing the styles is trying to match.<br>
&lt;TabAtkins> florian: the document author is trying to describe what they have and make themselves available for matching, the ematcher can be specific or general depending on their need<br>
&lt;fantasai> +1<br>
&lt;TabAtkins> florian: so you want to be available to both types of matching<br>
&lt;TabAtkins> florian: so even if the macro-language family doesn't have strong typographic commonality, the person writing the selector can be specific if they need to. maybe some macro lang families you *always* need to be, but it's the style author's responsibility for that<br>
&lt;dbaron> s/[gives an example I can't transcribe]/ for example would Bokmål versus Nynorsk Norwegian use the same hyphenation dictionary or not/<br>
&lt;TabAtkins> florian: so I think this is right *for our use-case*, even if it's not right for all cases.<br>
&lt;TabAtkins> florian: and that's even within dbaron's point, we're talking about :lang() specifically, even if there are other style use-case that want something else<br>
&lt;astearns> ack r12a<br>
&lt;castastrophe> I mentioned that we have at times attached padding, spacing, and color palette updates as associated with JA lang<br>
&lt;TabAtkins> r12a: dbaron said much of what I wanted to say. if you're doing styling for voice browsers, you're more interested in the lang rather than the script. if you're doing typography you're more interested in the script, at least initially<br>
&lt;TabAtkins> r12a: so this answers cassandra's point. i'm constantly wanting to group by script rather than lang<br>
&lt;florian> q+<br>
&lt;TabAtkins> r12a: so ther'es definitely a difference there, an ambiguity<br>
&lt;TabAtkins> r12a: so limiting what we're talkinga bout to specific controls like :lang() is probably important for our final decision<br>
&lt;astearns> ack addison<br>
&lt;castastrophe> The reason being that we have received design survey feedback from Japanese customers that they prefer tighter padding and brighter color palettes in their web design<br>
&lt;TabAtkins> addison: acknowledging david's point, there's things people use lang tags for like getting the right voice, or hyphenation<br>
&lt;TabAtkins> addison: regarding scripts, there is info in CLDR that tries to compute missing script subtags, might be worth looking at in the future<br>
&lt;TabAtkins> addison: for making :lang() better at matching script variation when the script isn't specific<br>
&lt;castastrophe> As I understand it, HTML == very specific to what is currently being rendered; CSS/JS == logical groupings of applying styles to the widest group of languages relevant<br>
&lt;TabAtkins> addison: so in general I think I agree with Florian's direction. different than how I came into the meeting.<br>
&lt;TabAtkins> addison: maybe this is largely a Chinese issue, the other macro langs are possibly a lot worse<br>
&lt;TabAtkins> addison: Chinese tends to be written pretty monolithically across all the langs, you can tag it as zh<br>
&lt;TabAtkins> addison: might be different from the other macro langs where you want different voice, different hyphenation, etc<br>
&lt;florian> q?<br>
&lt;TabAtkins> addison: word-breaking dictionaries, like in thai<br>
&lt;florian> q+<br>
&lt;TabAtkins> addison: where you don't want to force one language's resources onto a document just because it gets a specific thing<br>
&lt;TabAtkins> astearns: could we resolve to accept the edits in the spec, given today's convo, but have some new issues about the questions raised? whether this should only apply to Chinese, whether the filtering should apply elsewhere...<br>
&lt;TabAtkins> florian: that's what we're doing<br>
&lt;TabAtkins> astearns: ok. I'm happy with that outcome, but I'd be happy to hold off if anything thinks it's premature<br>
&lt;TabAtkins> addison: maybe if y'all do a resolution we can review the text afterwards<br>
&lt;TabAtkins> florian: i'll point out the proposed text was accidentally merged already.<br>
&lt;TabAtkins> florian: so I think we shouldn't remove it. if more specific problems come along we should look at those<br>
&lt;TabAtkins> fantasai: i'm in favor of resolving that we take ex-lang canonicalization for :lang(), as currently specified<br>
&lt;TabAtkins> fantasai: even for langs that are less homogenous, if author requests that generic matching we should respect it<br>
&lt;TabAtkins> fantasai: if there are specific problems we can deal with that<br>
&lt;TabAtkins> astearns: so proposed resolution is we accept the edits for this issue<br>
&lt;TabAtkins> RESOLVED: Accept already-merged edits<br>
&lt;TabAtkins> astearns: if i18n has the time to look at this text with this convo in mind, would be appreciated<br>
&lt;TabAtkins> fantasai: Addison, you mentioned this cnaonicalization sometimes adds subtags<br>
&lt;TabAtkins> addison: like "yue" becomes "zh-yue"<br>
&lt;TabAtkins> addison: CLDR has an "add likely subtags" mechanisms that will correct "zh-hk" to add a script subtag, but what you have doesn't do that<br>
&lt;TabAtkins> fantasai: ok. we should look into that at some point<br>
&lt;TabAtkins> ChrisL: while i18n people are here...<br>
&lt;TabAtkins> ChrisL: I just looked through Fonts 4 about scripts, there's nothing<br>
&lt;TabAtkins> ChrisL: open type has script tags, we barely mention them.<br>
&lt;TabAtkins> ChrisL: it looks like we have a gap in fonts, doing specific things for some scripts regardless of lang<br>
&lt;TabAtkins> florian: yes, script matching (separate from lang matching) is a whole nother ballgame<br>
&lt;TabAtkins> florian: might need a pseudo-element rather than pseudo-class, etc<br>
&lt;TabAtkins> ChrisL: okay, i'll need help with that<br>
&lt;TabAtkins> addison: you guys do extended filtering, right? can do "*-cyril"<br>
&lt;TabAtkins> florian: yeah, open type stuff is another issue, interesting but different<br>
&lt;TabAtkins> [Chris volunteers to raise an issue]<br>
&lt;ChrisL> s/Chris/Chris and Florian/<br>
</details>


-- 
GitHub Notification of comment by css-meeting-bot
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/4154#issuecomment-3818950415 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Thursday, 29 January 2026 16:52:29 UTC