- From: Dael Jackson <daelcss@gmail.com>
- Date: Tue, 3 Dec 2019 18:14:20 -0500
- To: www-style@w3.org
- Cc: www-international@w3.org
========================================= These are the official CSSWG minutes. Unless you're correcting the minutes, Please respond by starting a new thread with an appropriate subject line. ========================================= Joint Meeting with Internationalization +++++++++++++++++++++++++++++++++++++++ Selectors --------- - There is complex interaction between the different historic sets of :lang() tags and subtags which makes resolving issue #4154 (Canonicalization of :lang() selectors) complex. The PR doesn't capture all the complexity so florian will work with addison to see if a safe subset can be defined. florian will also look at the impacts of his proposal on various languages to ensure it's safe. CSS Text -------- - RESOLVED: The presence of soft break opportunities between spans which change soft breaking rules is undefined (Issue #3897: Breaking Rules at inline element boundaries) - The i18n group will look further into issue #3481 (Remove collapsible line breaks adjacent to word separators) especially around cases such as the Ogham space mark. ===== FULL MINUTES BELOW ======= Agenda: https://wiki.csswg.org/planning/tpac-2019#agenda Scribe: heycam Joint Meeting with Internationalization +++++++++++++++++++++++++++++++++++++++ Selectors ========= Canonicalization of :lang() selectors ------------------------------------- github: https://github.com/w3c/csswg-drafts/issues/4154 florian: The :lang selector lets you select pieces of the DOM for styling based on the language florian: It's already somewhat smart, since lang tags are structured florian: Selecting zh, and the document saing zh-Hant, it will do the right thing and match it florian: that logic is already built in florian: The IANA maintains a registry of the languages that exist and what they mean florian: tags and subtags florian: and in addition to just listing them, there is logic in that registry. Some languages are a deprecated version of some other languages florian: Cantonese used to be zh-yue. That is deprecated and replaced with yue florian: The lang selector does not take that logic into account florian: So if you have a document marked as lang="yue", and you are matching :lang(zh) or :lang(zh-yue), it won't match florian: We may want to use the registry definitions of how to match florian: I propose we do that addison: Some tag canonicalization is defined by BCP 47 to consume some of the information in the registry addison: You've been corresponding on the IETF languages list and I think some of your questions have been about handling macro-languages -- zh-yue is a macro language florian: zh-yue is a macro language, zh is a macro language addison: There's a separate thing. Previous to the current BCP 47, there was a mechanism for registering whole tags addison: that's grandfathered now addison: Some of them match subtags, some don't addison: [...] is replaced by xtg addison: Ignoring grandfathered tags, they all map to something. The ones you're referring to are structurally identical, the tags are composed of subtags addison: like zh-yue florian: The way I'm looking at this, there are variety of reasons for why certain languages might be the same florian: there is a defined canonicalization that handles some of them addison: For the BCP 47 canonicalization, that will do away with the grandfathered ones and other structural weirdness florian: It won't deal with the two types of Norwegian florian: This is a complicated topic with many weird variants addison: There's a subset there that's well defined addison: There's a second set of rules, which are in CLDR addison: UTR 35 addison: for handling some additional cases around Chinese, where you have different script subtags that you want to appear or not in some circumstances addison: some of those may be of interest, but it's more complicated addison: I don't want to pretend that doesn't exist, but they do florian: If you have a link, please drop it addison: Defining matching, if you're just using BCP 47 "lookup" IINM florian: Extended filtering florian: the text for extended filtering says you should canonicalize addison: Yes you should florian: Thanks for bringing up that the topic is broader addison: If you do the minimum set, it'll make it the most predictable. the other aspects are worth studying addison: there are some annoying corner cases in Chinese florian: I hear support for the current proposal, and complicated problems to think about in addition to that addison: Yes I agree with your current proposal and then do further study, and track the other standards happening in that space florian: There is a PR for this addison: Should we review that? <fantasai> https://github.com/frivoal/csswg-drafts/commit/3cff5d844b6415ef30d3e2dac221f9479e0ec7aa florian: If you haven't I suggest you do AmeliaBR: The other question on the topic, do we have implementor commitments? r12a: The current text I'm looking at says "... must be converted to x-lang form" r12a: that's a slightly different discussion from what you canonicalize it as r12a: zh-yue would become yue florian: I had that discussion on the list as well florian: This is the right direction florian: zh doesn't match yue so if you canonicalize both to x-lang format, it'll match florian: I raised this on the mailing list, and they agreed it was the right form to canonicalize it to addison: Some people on the list did addison: The challenge is that this will bring you more promiscuous matching than the author may have intended addison: It'll make Canontese match Mandarin Chinese in some cases florian: If you want to match Mandarin specifically that's also possible addison: Normally Mandarin is tagged just as zh r12a: For all the macro languages there's usually a preferred language fantasai: If the author cares that much, they can put the information there addison: That's right addison: you don't want to have them with a correctly tagged document, have the :lang match things they were [...] <addison> http://www.unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers duerst: That mailing list is no longer a WG duerst: so people can give you opinions and background knowledge, but no formal resolutions <AmeliaBR> So, to cases: (A) author used zh in stylesheet and yue in HTML; doesn't expect a match. (B) author used zh in stylesheet and zh-yue in HTML; does expect a match. Canonicalizing both yue and zh-yue to the same value will break one or the other. florian: I agree that the problem can exist in both directions, too much or not enough, I think since we're doing it for typographical purposes, and the languages are related, most of the time if you have zh styles you want it to match Cantonese too florian: It's possible to style Mandarin differently from Cantonese, Hakka, etc., but that's rare <addison> http://www.unicode.org/reports/tr35/#Likely_Subtags r12a: It's not just Chinese we're talking about r12a: There are other languages that have much more differentiation between the language depending on which of the subtags you choose r12a: The point I wanted to make was that we said that let's go ahead with the proposal at the moment r12a: Looking at the issue, there was a proposal you wrote, I responded saying you had to modify that r12a: the PR doesn't say much r12a: not sure what the exact proposal is r12a: I think this information we're talking about now should also be part of that florian: The earlier proposal that you rightfully pointed out I wrote too much, including making zh-HK match yue and things like this, that's not defined in the repo I'm referring to florian: I'm just saying, just the canonicalization to x-lang form as defined by BCP 47 florian: and as supported by the mailing list that used to be the WG that used to define that document florian: but whichever way we go, including no change at all, has a risk of mismatching things in some cases addison: Not all tags match all values, otherwise what's the point addison: The problem is to arrive at something that authors understand how to get the results they want addison: we'll make some compromises, the question in which ones fantasai: Based on the conversation so far, it seems like I don't think canonicalizing yue to zh-yue is going to be good. Either we don't canonicalize, or in a direction where zh encompasses Cantonese fantasai: I am sure there are style sheets that just use :lang(zh), and they'll expect it to match addison: The other possibility is that the inclusion or non-inclusion of the enclosing subtag -- in this case zh -- is a choice the author is making deliberately. if they've made that choice deliberately, if we mess with their tags when doing matching it may produce results they don't expect addison: Most of the matching algorithms are strict "remove from right" subtag matching addison: to make it obvious what's happening addison: What's you start adding or subtracting subtags in ways other than the deprecation/renaming, I think that has more risk to it in your space addison: since it's not obvious what's going to happen addison: I would support doing the mappings that's in the registry, since that's where if you have multiple variations, because people have older documents and style sheets, they'll get the right answer addison: That's different than adding or subtracting subtags AmeliaBR: We covered a lot of what I was going to say, but with a different conclusion AmeliaBR: It's important that when matching a style sheet and a document that we respect the way that the author matched it, don't want to introduce spurious matching from canonicalization AmeliaBR: also don't want to break matching AmeliaBR: From the examples brought up, it's obvious that any canonicalization may end up breaking one site or the other AmeliaBR: The question is then how do we make it easier in the general case for having new style sheets or new UA style rules deal with all these deprecated synonyms AmeliaBR: At the UA style sheet, that can just be an advice to UAs to look up the BCP deprecation list AmeliaBR: then also included the deprecated synonymous AmeliaBR: That doesn't work for things like a style sheet that is coming from a library or CSS reset AmeliaBR: or the case of newer code, writing a new new style sheet, but still apply to the old pages with the older language tags AmeliaBR: One approach that might address that use case is something like what we do with case insensitive selector matching AmeliaBR: a flag in the selector that means "this value or any synonyms" florian: So an opt in for canonicalization addison: There are three sets addison: the grandfathered list is permanently fixed and has been for 10 years addison: all those tags have explicit mappings, you can safely map them to modern equivalents or vv addison: Individual subtags that have mappings, it's mostly about countries going out of business addison: yiddish has two subtags, hebrew has two subtags, there's a canonical one addison: The third thing is the x-lang thing, which is inconvenient addison: because there's two ways to say things. With or without the enclosing subtag addison: The canonicalization rule in BCP 47 says you can drop the primary language subtag and use the x-lang by itself addison: it's permissible for implementations to do that addison: I don't recall it says you can put it back florian: There are 2 sets of rules florian: one that just strips it off. The other says when you're done stripping it off, put it back r12a: It says you could consider doing that addison: The first two are completely safe r12a: You want to do those r12a: for interop r12a: The x-lang thing, I think you can choose r12a: whether to put the enclosing subtag on r12a: The challenge is that Chinese you'd want to do that, but some of the other macro languages are not as crisp. Arabic is one of these, Malaysian <r12a> https://r12a.github.io/app-subtags/ r12a: Omani Arabic and Moroccan Arabic, which treat certain things differently, may have different font requirements r12a: but they both resolve to "ar" if we follow this PR r12a: but that's used for standard Arabic <myles> thanks for the link, r12a is the best <fantasai> +1 florian: I think we're not ready to merge the PR florian: Action items: the safe subset of canonicalization, I don't think it's defined as a canonicalizing operation separately from the x-lang thing florian: Action on me to find out if we can addison: This is an area that probably deserves better documentation from us addison: We can go offline and make sure we get the right answer addison: We can go back and talk to the locale folks at Unicode and the languages list and make sure we're capturing the sense of this florian: One, figure it out if the safe subset exists as a standard operation florian: Two, if we do what I'm proposing, look at the affected languages and see if it's good for them CSS Text 3 ========== Breaking Rules at inline element boundaries ------------------------------------------- github: https://github.com/w3c/csswg-drafts/issues/3897 fantasai: There was an issue raised about what happens when you have two inline elements that have different breaking rules fantasai: 3 properties control this. white-space, word-break, line-break fantasai: Looking at an example (in the GitHub issue) fantasai: at the boundaries of the span, which line breaking rules applies when it has a different word-break prop value to the rest of the text fantasai: for white-space, the nearest common ancestor is used fantasai: The complication for word-break and line-break is that the determining rules for where you're allowed to break requires running an analysis on a lot of text fantasai: and every time that value changes you have to do another run, so impl wise it's a bit awkward fantasai: There's been some discussion about what's the best behavior here fantasai: I wanted to ask i18n if you have feedback on this issue fantasai: and ask the WG if this proposal to leave this undefined for L3, give impl time to experiment fantasai: Doesn't seem to be a terribly high importance case to solve at the moment florian: I think one of the more interesting cases -- and I support making it undefined -- is if the parent div allows a break between every latter, and two spans next to each other which don't florian: Can you break between the spans or not florian: Current spec says yes, but it's hard-ish to implement florian: If we need time to think about this, undefine it for a while, seems reasonable florian: but that's the kind of case this brings to the surface Rossen: Any objections to leaving it undefined? r12a: we should look at it as a group offline r12a: It's quite a long thread. I seem to remember someone brought up an example that didn't work nmccully: Are there layout engines working on this that would benefit from the extra time? fantasai: Part of the issue is the ICU APIs make it awkward for the rules to change in the middle of the line fantasai: so impl wise it's awkward fantasai: Could be factorial if you're changing it every other letter in the line fantasai: so there's some hesitancy to impl that given the current infrastructure fantasai: but there doesn't seem to be great solutions fantasai: Some of the behaviors you'd get from doing an easy thing would be non-symmetric fantasai: you'd be switching slightly less if you use the current rule in the spec, but that's all fantasai: There's not a high pressure to solve this and get interop fantasai: Look at it again in L4 myles: Tangential comment, the general thing we're discussing is styling element boundaries myles: This is something letter-spacing also does myles: The spec says something that all browsers disagree with myles: With we do come up with a good way to describe boundary behavior, we should try to use this system to describe letter-spacing too fantasai: I think the spec is right on letter-spacing nigel: I think it would be good to have a general way to handle this florian: we have a current generalized rule, that is general, and does the right thing, and is painful to impl RESOLVED: It's undefined <myles> the presence of soft break opportunities between spans which change soft breaking opportunities is undefined heycam RESOLVED: i.e., the presence of soft break opportunities between spans which change soft breaking opportunities is undefined Remove collapsible line breaks adjacent to word separators ---------------------------------------------------------- github: https://github.com/w3c/csswg-drafts/issues/3481 fantasai: We generally have this concept in CSS and HTML that you can use white space to format your source, and we collapse white space, including line breaks, down to a single space fantasai: essentially unbreaking the source lines to create a paragraph fantasai: For Chinese and Japanese which don't use spaces, we have some rules to remove the space; otherwise you will be forced to put entire paragraphs on one line always fantasai: There are some rules for doing that based on character classes fantasai: What we didn't consider thoroughly is languages that use a word separator that's not a space fantasai: We do special case ZWSP, for Thai and other languages fantasai: but we don't have something similar for Ethiopic word space fantasai: Probably don't also want a regular space added there fantasai: Proposal is when there's a word separator character adjacent to a line break, the line break just goes away fantasai: I think the characters that are affected here are Ogham space mark and Ethiopic word space and the Tibetan tsek <koji> https://drafts.csswg.org/css-text-3/#word-separator AmeliaBR: Does this map to something in Unicode? or do we need to maintain this list? r12a: I think there is something, not sure if it's fit for this purpose r12a: archaic scripts have other examples fantasai: [reads definition in the spec right now for word-spacing] florian: We need to maintain a list myles: Let's ask Unicode to do it myles: If there is such a facility for these character lists, hard to believe it's specific for the web platform myles: and not needed in text editors for example myles: I don't think the web specs should maintain this list florian: I agree with part of your statement, should try to work this out with Unicode florian: This one specifically maybe, but some are specifically web platform relatively florian: since this is relevant to turning HTML markup into text myles: There are many different markup languages... fantasai: There are 2 questions fantasai: if we want to do this, and then whether we maintain the list or if Unicode should addison: I think we want to do some research addison: space or no space is a classic problem addison: I would be surprised if there weren't something, but don't know off the top of my head addison: would be happy to engage myles: If this is a classical problem, it's been solved, and we should figure out how it's been solved in the past and re-use that solution fantasai: looking at some of the stuff in css-text, we have a concept of word separators fantasai: and it includes a set of code points fantasai: It excludes Ogham space mark fantasai: since it would cause text to not join any more [word-spacing has different considerations than white space collapsing] fantasai: So general usage in Unicode is text processing segmentation is not going to account for that concern, since they don't deal with typesetting fantasai: So there's gonna be some aspects of how we're using Unicode codepoints with specific requirements that haven't come up in Unicode's context so far fantasai: Unbreaking lines is something that's been hard to explain to them myles: Maybe we shouldn't be unbreaking them? fantasai: Too late for that! fantasai: HTML has been unbreaking lines for as long as it has existed, we want to make that ability available to more languages addison: fwiw I've had to write this code in the past, and it's not any fun addison: It may have been individually solved but not written down r12a: Like with the other issues, we need to look in more detail r12a: the Tsek is a syllable separator, not the same as a word joiner r12a: You could end a line with a Tsek, then start with more Tibetan on the next line, with indentation, and no real reason to join those together necessarily fantasai: You wouldn't make the Tsek go away, just avoid the extra space going in there ACTION: i18n to look this issue of word separators next to newlines ACTION: addison: ensure we respond to css 3481
Received on Tuesday, 3 December 2019 23:15:19 UTC