Re: [csswg-drafts] Consider Canonicalization of language tags in :lang() selector maching (#4154)

The CSS Working Group just discussed `canonicalization of :lang() selectors`.

<details><summary>The full IRC log of that discussion</summary>
&lt;heycam> Topic: canonicalization of :lang() selectors<br>
&lt;heycam> github: https://github.com/w3c/csswg-drafts/issues/4154<br>
&lt;heycam> florian: the :lang selector lets you select pieces of the DOM for styling based on the language<br>
&lt;heycam> ... it's alreay somehat smart, since lang tags are structured<br>
&lt;heycam> ... selecting zh, and the document saing zh-Hant, it will do the right thing and match it<br>
&lt;heycam> ... that logic is already built in<br>
&lt;heycam> ... the IANA maintains a registry of the langauges that exist and what they mean<br>
&lt;heycam> ... tags and subtags<br>
&lt;heycam> ... and in addition to just listing them, there is logic in that registry. some languages are a deprecated version of some other languages<br>
&lt;heycam> ... Cantonese used to be zh-yue. that is deprecated and replaced with yue<br>
&lt;heycam> ... the lang selector does not take that logic into account<br>
&lt;heycam> ... so if you have a document marked as lang="yue", and you are matching :lang(zh) or :lang(zh-yue), it won't match<br>
&lt;heycam> ... we may want to use the registry definitions of how to match<br>
&lt;heycam> ... I propose we do that<br>
&lt;heycam> addison: some tag canonicalization is defined by BCP 47 to consume some of the information in the registry<br>
&lt;heycam> ... you've been corresponding on the IETF langauges list and I think some of your questions have been about handling macro-languages -- zh-yue is a macro language<br>
&lt;heycam> florian: zh-yue is a macro language, zh is a macro language<br>
&lt;heycam> addison: there's a separate thing. previous to the current BCP 47, there was a mechanism for regsitring whole tags<br>
&lt;heycam> ... that's grandfathered now<br>
&lt;heycam> ... some of them match subtags, some don't<br>
&lt;heycam> ... [...] is replaced by xtg<br>
&lt;heycam> addison: ignoring grandfathered tags, they all map to something.  the ones you're referring to are structurally identical, the tags are composed of subtags<br>
&lt;heycam> ... like zh-yue<br>
&lt;heycam> florian: the way I'm looking at this, there are variety of reasons for why certain langauges might be the same<br>
&lt;heycam> ... there is a defined canonicalization that handles some of them<br>
&lt;heycam> addison: for the BCP 47 canonicalization, that will do awy with the grandfathered ones and other strucutral weirdness<br>
&lt;heycam> florian: it won't deal with the two types of norwegian<br>
&lt;heycam> ... this is a complicated topic with many weird variants<br>
&lt;heycam> addison: there's a subset there that's well defined<br>
&lt;heycam> ... there's a second set of rules, which are in CLDR<br>
&lt;heycam> ... UTF 35<br>
&lt;AmeliaBR> s/UTF/UTR/<br>
&lt;heycam> ... for handling some additional cases around Chinese, where you have different script subtags that you want to appear or not in some circurmstances<br>
&lt;heycam> ... some of those may be of interest, but it's more complicated<br>
&lt;heycam> ... I don't want to pretend that doesn't exist, but they do<br>
&lt;heycam> florian: if you have a link, please drop it<br>
&lt;heycam> addison: defining matching, if you're just using BCP 47 "lookup" IINM<br>
&lt;heycam> florian: extended filtering<br>
&lt;heycam> ... the text for extended filtering says you should canonicalize<br>
&lt;heycam> addison: yes you should<br>
&lt;heycam> florian: thanks for bringing up that the topic is broader<br>
&lt;heycam> addison: if you do the minimum set, it'll make it the most predictable.  the other aspects are worth studying<br>
&lt;heycam> ... there are some annoying corner cases in Chinese<br>
&lt;heycam> florian: I hear support for the current proposal, and complicatd problems to think about in addition to that<br>
&lt;heycam> addison: yes I agree with your current proposal and then do further study, and track the other standards happening in that space<br>
&lt;heycam> florian: there is a PR for this<br>
&lt;heycam> addison: should we review that?<br>
&lt;fantasai> https://github.com/frivoal/csswg-drafts/commit/3cff5d844b6415ef30d3e2dac221f9479e0ec7aa<br>
&lt;heycam> florian: if you haven't I suggest you do<br>
&lt;heycam> AmeliaBR: the other question on the topic, do we have implementor commitments?<br>
&lt;heycam> r12a: the current text I'm looking at says "... must be converetd to x-lang form"<br>
&lt;heycam> ... that's a slightly different discussion from what you canonicalize it as<br>
&lt;heycam> ... zh-yue would become yue<br>
&lt;heycam> florian: I had that discussion on the list as well<br>
&lt;heycam> ... this is the right direction<br>
&lt;heycam> ... zh doens't match yue.  so if you canonicalize both to x-lang format, it'll match<br>
&lt;heycam> florian: I raised this on the mailing list, and they agreed it was the right form to canonicalize it to<br>
&lt;heycam> addison: some people on the list did<br>
&lt;heycam> ... the challenge is taht this will bring you more promiscuous matching than the author may have intended<br>
&lt;heycam> ... it'll make Canontese match Mandarin Chinese in some cases<br>
&lt;heycam> florian: if you want to match Mandarin specifically that's also possible<br>
&lt;heycam> addison: normally Mandarin is tagged just as zh<br>
&lt;heycam> r12a: for all the macro languages there's usually a preferred language<br>
&lt;heycam> fantasai: if the author cares that much, they can put the information there<br>
&lt;Rossen__> q?<br>
&lt;heycam> addison: that's right<br>
&lt;duerst> q+<br>
&lt;heycam> ... you don't want to have them with a correctly tagged document, have the :lang match things they were [...]<br>
&lt;xfq> ack du<br>
&lt;heycam> duerst: that mailing list is no longer a WG<br>
&lt;addison> http://www.unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers<br>
&lt;heycam> ... so people can give you opinions and background knowledge, but no formal resolutions<br>
&lt;AmeliaBR> So, to cases: (A) author used zh in stylesheet and yue in HTML; doesn't expect a match. (B) author used zh in stylesheet and zh-yue in HTML; does expect a match. Canonicalizing both yue and zh-yue to the same value will break one or the other.<br>
&lt;Rossen__> q?<br>
&lt;heycam> florian: I agree that the problem can exist in both directions, too much or not enough, I think since we're doing it for typographical purposes, and the languages are realted, most of the time if you have zh styles you want it to match Cantonese too<br>
&lt;addison> http://www.unicode.org/reports/tr35/#Likely_Subtags<br>
&lt;heycam> ... it's possible to style Mandarin differently from Cantonese, Hakka, etc., but that's rare<br>
&lt;heycam> r12a: it's not just Chinese we're talking about<br>
&lt;heycam> ... there are other languages that have much more differentiation between the language depending on which of the subtags you choose<br>
&lt;AmeliaBR> q+ to suggest that this is better dealt with in the user agent stylesheet<br>
&lt;heycam> ... the point I watned to make was that we said that let's go ahead with the proposal at the moment<br>
&lt;heycam> ... looking at the issue, there was a proposal you wrote, I responded saying you had to modify that<br>
&lt;heycam> ... the PR doesn't say much<br>
&lt;heycam> ... not sure what the exact proposal is<br>
&lt;heycam> ... I think this information we're talking about now should also be part of that<br>
&lt;heycam> florian: the earlier proposal that you rightfully pointed out I wrote too much, including making zh-HK match yue and things like this, that's not defined in the repo I'm referring to<br>
&lt;heycam> ... I'm just saying, just the canonicalization to x-lang form as defined by BCP 47<br>
&lt;heycam> ... and as supported by the mailing list that used to be the WG defining that document<br>
&lt;heycam> ... btu whichever way we go, including no change at all, has a risk of mismatching things in some cases<br>
&lt;heycam> addison: not all tags match all values, otherwise what's the point<br>
&lt;dbaron> s/WG defining/WG that used to define/<br>
&lt;heycam> ... the problem is to arrive at something that authors understand how to get the results they want<br>
&lt;heycam> ... we'll make some compromises, the question in which ones<br>
&lt;heycam> fantasai: based on the conversation so far, it seems like I don't think canonicalizing yue to zh-yue is going to be good. either we don't canonicilze, or in a direction where zh encompasses Cantonese<br>
&lt;heycam> ... I am sure there are style sheets that just use :lang(zh), and they'll expect it to match<br>
&lt;heycam> addison: the other possibility is that the inclusion or non-inclusion of the enclosing subtag -- in this case zh -- is a choice the author is making deliberately. if they've made that choice deliberately, if we mess eith their tags when doing matching it may produce results they don't expect<br>
&lt;heycam> ... most of the matching algorithms are strict "remove from right" subtag matching<br>
&lt;heycam> ... to make it obvious what's happening<br>
&lt;heycam> ... what's you start adding or subtracting subtags in ways other than the deprecation/renaming, I think that has more risk to it in your space<br>
&lt;heycam> ... since it's not obvious what's going to happen<br>
&lt;heycam> ... I would support doing the mappings that's in the registry, since that's where if you have mlutiple variations, because people have older documents and style sheets, they'll get the right answer<br>
&lt;heycam> ... that's different than adding or subtracting subtags<br>
&lt;xfq> ack Ame<br>
&lt;Zakim> AmeliaBR, you wanted to suggest that this is better dealt with in the user agent stylesheet<br>
&lt;heycam> AmeliaBR: we covered a lot of what I was going to say, but witha different conclusion<br>
&lt;heycam> ... it's important that when matching a style sheet and a document that we respect the way that the author matched it, don't want to introduce spurious matching from canonicalization<br>
&lt;heycam> ...also don't want to break matching<br>
&lt;heycam> ... from the examples brought up, it's obvious that any canoniclization may end up breaking one site or the other<br>
&lt;heycam> ... the question is then how do we make it easier in the general case for having new style sheets or new UA style rules deal with all these deprecated synonyms<br>
&lt;heycam> ... at the UA style sheet, that can just be an advice to UAs to look up the BCP deprecation list<br>
&lt;heycam> ... then also included the deprecated synonmous<br>
&lt;heycam> .. that doesn't work for things like a style sheet that is coming from a library or CSS reset<br>
&lt;heycam> ... or the case of newer code, writing a new new style sheet, but still apply to the old pages with the older language tags<br>
&lt;heycam> ... one approach that might address that use case is something like what we do with case insensitive selector matching<br>
&lt;heycam> ... a flag in the selector that means "this value or any synonms"<br>
&lt;heycam> florian: so an opt in for canonicalization<br>
&lt;heycam> addison: there are three sets<br>
&lt;heycam> ... the grandfathered list is permanently fixed and has been for 10 years<br>
&lt;heycam> ... all those tags have explicit mappings, you can safely map them to modern equivalents or vv<br>
&lt;heycam> addison: individual subtags that ahve mappings, it's mostly about countries going out of business<br>
&lt;heycam> ... yiddish has two subtags, hebrew has two subtags, there's a canonical one<br>
&lt;heycam> .... the third thing is the x-lang thing, which is inconvenient<br>
&lt;heycam> ... because there's two ways to say things.  with or without the enclosing subtag<br>
&lt;heycam> ... the canonicalization rule in BCP 47 says you can drop the primary langauge subtag and use the x-lang by itself<br>
&lt;heycam> ... it's permissible for implementations to do that<br>
&lt;heycam> ... I don't recall it says you can put it back<br>
&lt;heycam> florian: there are 2 sets of rules<br>
&lt;heycam> ... one that just strips it off.  the other says when you're done stripping it off, put it back<br>
&lt;heycam> r12a: it says you could consider doing that<br>
&lt;heycam> addison: the first two are completely safe<br>
&lt;heycam> ... you want to do those<br>
&lt;heycam> ... for interop<br>
&lt;heycam> ... the x-lang thing, I think you can choose<br>
&lt;heycam> ... whether to put the enclosing subtag on<br>
&lt;heycam> ... the challenge is that Chinese you'd want to do that, but some of the other macro languages are not as crisp.  Arabic is one of these, Malaysian<br>
&lt;r12a> https://r12a.github.io/app-subtags/<br>
&lt;heycam> r12a: Omani Arabic and Moroccan Arabic, which treat certain things differently, may have different font requirements<br>
&lt;heycam> ...  but they both resolve to "ar" if we follow this PR<br>
&lt;heycam> ... but that's used for standard Arabic<br>
&lt;heycam> florian: I think we're not ready to merge the PR<br>
&lt;heycam> ... action items: the safe subset of canonicalization, I don't think it's defined as a canonicalizing operation separately from the x-lang thing<br>
&lt;heycam> ... action on me to find out if we can<br>
&lt;heycam> addison: this is an area that probably deserves better documentation from us<br>
&lt;heycam> ... we can go offline and make sure we get the right answer<br>
&lt;heycam> ... we can go back and talk to the locale folks at UNicode and the languages list and make sure we're capturing the sense of this<br>
&lt;heycam> florian: one, figure it ouf if the safe subset exists as a standard operation<br>
&lt;heycam> ... two, if we do what I'm proposing, look at the affected languages and see if it's good for them<br>
</details>


-- 
GitHub Notification of comment by css-meeting-bot
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/4154#issuecomment-532055717 using your GitHub account

Received on Tuesday, 17 September 2019 04:40:50 UTC