[CSSWG] Minutes Fukuoka F2F 2019-09-17 Part IV: Joint Meeting with Internationalization Group [css-selectors] [css-text] from Dael Jackson on 2019-12-03 (www-style@w3.org from December 2019)

From: Dael Jackson <daelcss@gmail.com>
Date: Tue, 3 Dec 2019 18:14:20 -0500
To: www-style@w3.org
Cc: www-international@w3.org
Message-ID: <CADhPm3uYMEMtVFAuZGdPTrePeYj0AB8TVHN+1ASH0RB6frVQ2A@mail.gmail.com>
=========================================
  These are the official CSSWG minutes.
  Unless you're correcting the minutes,
 Please respond by starting a new thread
   with an appropriate subject line.
=========================================


Joint Meeting with Internationalization
+++++++++++++++++++++++++++++++++++++++

Selectors
---------

  - There is complex interaction between the different historic sets
      of :lang() tags and subtags which makes resolving issue #4154
      (Canonicalization of :lang() selectors) complex. The PR doesn't
      capture all the complexity so florian will work with addison to
      see if a safe subset can be defined. florian will also look at
      the impacts of his proposal on various languages to ensure it's
      safe.

CSS Text
--------

  - RESOLVED: The presence of soft break opportunities between spans
              which change soft breaking rules is undefined
              (Issue #3897: Breaking Rules at inline element
              boundaries)
  - The i18n group will look further into issue #3481 (Remove
      collapsible line breaks adjacent to word separators) especially
      around cases such as the Ogham space mark.

===== FULL MINUTES BELOW =======

Agenda: https://wiki.csswg.org/planning/tpac-2019#agenda

Scribe: heycam

Joint Meeting with Internationalization
+++++++++++++++++++++++++++++++++++++++

Selectors
=========

Canonicalization of :lang() selectors
-------------------------------------
  github: https://github.com/w3c/csswg-drafts/issues/4154

  florian: The :lang selector lets you select pieces of the DOM for
           styling based on the language
  florian: It's already somewhat smart, since lang tags are structured
  florian: Selecting zh, and the document saing zh-Hant, it will do
           the right thing and match it
  florian: that logic is already built in
  florian: The IANA maintains a registry of the languages that exist
           and what they mean
  florian: tags and subtags
  florian: and in addition to just listing them, there is logic in
           that registry. Some languages are a deprecated version of
           some other languages
  florian: Cantonese used to be zh-yue. That is deprecated and
           replaced with yue
  florian: The lang selector does not take that logic into account
  florian: So if you have a document marked as lang="yue", and you are
           matching :lang(zh) or :lang(zh-yue), it won't match
  florian: We may want to use the registry definitions of how to match
  florian: I propose we do that

  addison: Some tag canonicalization is defined by BCP 47 to consume
           some of the information in the registry
  addison: You've been corresponding on the IETF languages list and I
           think some of your questions have been about handling
           macro-languages -- zh-yue is a macro language
  florian: zh-yue is a macro language, zh is a macro language

  addison: There's a separate thing. Previous to the current BCP 47,
           there was a mechanism for registering whole tags
  addison: that's grandfathered now
  addison: Some of them match subtags, some don't
  addison: [...] is replaced by xtg
  addison: Ignoring grandfathered tags, they all map to something. The
           ones you're referring to are structurally identical, the
           tags are composed of subtags
  addison: like zh-yue
  florian: The way I'm looking at this, there are variety of reasons
           for why certain languages might be the same
  florian: there is a defined canonicalization that handles some of
           them

  addison: For the BCP 47 canonicalization, that will do away with the
           grandfathered ones and other structural weirdness
  florian: It won't deal with the two types of Norwegian
  florian: This is a complicated topic with many weird variants
  addison: There's a subset there that's well defined
  addison: There's a second set of rules, which are in CLDR
  addison: UTR 35
  addison: for handling some additional cases around Chinese, where
           you have different script subtags that you want to appear
           or not in some circumstances
  addison: some of those may be of interest, but it's more complicated
  addison: I don't want to pretend that doesn't exist, but they do
  florian: If you have a link, please drop it
  addison: Defining matching, if you're just using BCP 47 "lookup" IINM
  florian: Extended filtering
  florian: the text for extended filtering says you should canonicalize
  addison: Yes you should
  florian: Thanks for bringing up that the topic is broader
  addison: If you do the minimum set, it'll make it the most
           predictable. the other aspects are worth studying
  addison: there are some annoying corner cases in Chinese

  florian: I hear support for the current proposal, and complicated
           problems to think about in addition to that
  addison: Yes I agree with your current proposal and then do further
           study, and track the other standards happening in that space
  florian: There is a PR for this
  addison: Should we review that?
  <fantasai> https://github.com/frivoal/csswg-drafts/commit/3cff5d844b6415ef30d3e2dac221f9479e0ec7aa
  florian: If you haven't I suggest you do

  AmeliaBR: The other question on the topic, do we have implementor
            commitments?
  r12a: The current text I'm looking at says "... must be converted to
        x-lang form"
  r12a: that's a slightly different discussion from what you
        canonicalize it as
  r12a: zh-yue would become yue
  florian: I had that discussion on the list as well
  florian: This is the right direction
  florian: zh doesn't match yue so if you canonicalize both to x-lang
           format, it'll match
  florian: I raised this on the mailing list, and they agreed it was
           the right form to canonicalize it to
  addison: Some people on the list did
  addison: The challenge is that this will bring you more promiscuous
           matching than the author may have intended
  addison: It'll make Canontese match Mandarin Chinese in some cases
  florian: If you want to match Mandarin specifically that's also
           possible
  addison: Normally Mandarin is tagged just as zh
  r12a: For all the macro languages there's usually a preferred
        language
  fantasai: If the author cares that much, they can put the
            information there
  addison: That's right
  addison: you don't want to have them with a correctly tagged
           document, have the :lang match things they were [...]
  <addison> http://www.unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers

  duerst: That mailing list is no longer a WG
  duerst: so people can give you opinions and background knowledge,
          but no formal resolutions

  <AmeliaBR> So, to cases: (A) author used zh in stylesheet and yue in
             HTML; doesn't expect a match. (B) author used zh in
             stylesheet and zh-yue in HTML; does expect a match.
             Canonicalizing both yue and zh-yue to the same value will
             break one or the other.
  florian: I agree that the problem can exist in both directions, too
           much or not enough, I think since we're doing it for
           typographical purposes, and the languages are related, most
           of the time if you have zh styles you want it to match
           Cantonese too
  florian: It's possible to style Mandarin differently from Cantonese,
           Hakka, etc., but that's rare
  <addison> http://www.unicode.org/reports/tr35/#Likely_Subtags

  r12a: It's not just Chinese we're talking about
  r12a: There are other languages that have much more differentiation
        between the language depending on which of the subtags you
        choose
  r12a: The point I wanted to make was that we said that let's go
        ahead with the proposal at the moment
  r12a: Looking at the issue, there was a proposal you wrote, I
        responded saying you had to modify that
  r12a: the PR doesn't say much
  r12a: not sure what the exact proposal is
  r12a: I think this information we're talking about now should also
        be part of that
  florian: The earlier proposal that you rightfully pointed out I
           wrote too much, including making zh-HK match yue and things
           like this, that's not defined in the repo I'm referring to
  florian: I'm just saying, just the canonicalization to x-lang form
           as defined by BCP 47
  florian: and as supported by the mailing list that used to be the WG
           that used to define that document
  florian: but whichever way we go, including no change at all, has a
           risk of mismatching things in some cases
  addison: Not all tags match all values, otherwise what's the point
  addison: The problem is to arrive at something that authors
           understand how to get the results they want
  addison: we'll make some compromises, the question in which ones

  fantasai: Based on the conversation so far, it seems like I don't
            think canonicalizing yue to zh-yue is going to be good.
            Either we don't canonicalize, or in a direction where zh
            encompasses Cantonese
  fantasai: I am sure there are style sheets that just use :lang(zh),
            and they'll expect it to match
  addison: The other possibility is that the inclusion or
           non-inclusion of the enclosing subtag -- in this case zh --
           is a choice the author is making deliberately. if they've
           made that choice deliberately, if we mess with their tags
           when doing matching it may produce results they don't expect
  addison: Most of the matching algorithms are strict "remove from
           right" subtag matching
  addison: to make it obvious what's happening
  addison: What's you start adding or subtracting subtags in ways
           other than the deprecation/renaming, I think that has more
           risk to it in your space
  addison: since it's not obvious what's going to happen
  addison: I would support doing the mappings that's in the registry,
           since that's where if you have multiple variations, because
           people have older documents and style sheets, they'll get
           the right answer
  addison: That's different than adding or subtracting subtags

  AmeliaBR: We covered a lot of what I was going to say, but with a
            different conclusion
  AmeliaBR: It's important that when matching a style sheet and a
            document that we respect the way that the author matched
            it, don't want to introduce spurious matching from
            canonicalization
  AmeliaBR: also don't want to break matching
  AmeliaBR: From the examples brought up, it's obvious that any
            canonicalization may end up breaking one site or the other
  AmeliaBR: The question is then how do we make it easier in the
            general case for having new style sheets or new UA style
            rules deal with all these deprecated synonyms
  AmeliaBR: At the UA style sheet, that can just be an advice to UAs
            to look up the BCP deprecation list
  AmeliaBR: then also included the deprecated synonymous
  AmeliaBR: That doesn't work for things like a style sheet that is
            coming from a library or CSS reset
  AmeliaBR: or the case of newer code, writing a new new style sheet,
            but still apply to the old pages with the older language
            tags
  AmeliaBR: One approach that might address that use case is something
            like what we do with case insensitive selector matching
  AmeliaBR: a flag in the selector that means "this value or any
            synonyms"
  florian: So an opt in for canonicalization

  addison: There are three sets
  addison: the grandfathered list is permanently fixed and has been
           for 10 years
  addison: all those tags have explicit mappings, you can safely map
           them to modern equivalents or vv
  addison: Individual subtags that have mappings, it's mostly about
           countries going out of business
  addison: yiddish has two subtags, hebrew has two subtags, there's a
           canonical one
  addison: The third thing is the x-lang thing, which is inconvenient
  addison: because there's two ways to say things. With or without the
           enclosing subtag
  addison: The canonicalization rule in BCP 47 says you can drop the
           primary language subtag and use the x-lang by itself
  addison: it's permissible for implementations to do that
  addison: I don't recall it says you can put it back
  florian: There are 2 sets of rules
  florian: one that just strips it off. The other says when you're
           done stripping it off, put it back
  r12a: It says you could consider doing that
  addison: The first two are completely safe
  r12a: You want to do those
  r12a: for interop
  r12a: The x-lang thing, I think you can choose
  r12a: whether to put the enclosing subtag on
  r12a: The challenge is that Chinese you'd want to do that, but some
        of the other macro languages are not as crisp. Arabic is one
        of these, Malaysian
  <r12a> https://r12a.github.io/app-subtags/
  r12a: Omani Arabic and Moroccan Arabic, which treat certain things
        differently, may have different font requirements
  r12a: but they both resolve to "ar" if we follow this PR
  r12a: but that's used for standard Arabic
  <myles> thanks for the link, r12a is the best
  <fantasai> +1

  florian: I think we're not ready to merge the PR
  florian: Action items: the safe subset of canonicalization, I don't
           think it's defined as a canonicalizing operation separately
           from the x-lang thing
  florian: Action on me to find out if we can
  addison: This is an area that probably deserves better documentation
           from us
  addison: We can go offline and make sure we get the right answer
  addison: We can go back and talk to the locale folks at Unicode and
           the languages list and make sure we're capturing the sense
           of this
  florian: One, figure it out if the safe subset exists as a standard
           operation
  florian: Two, if we do what I'm proposing, look at the affected
           languages and see if it's good for them

CSS Text 3
==========

Breaking Rules at inline element boundaries
-------------------------------------------
  github: https://github.com/w3c/csswg-drafts/issues/3897

  fantasai: There was an issue raised about what happens when you have
            two inline elements that have different breaking rules
  fantasai: 3 properties control this. white-space, word-break,
            line-break
  fantasai: Looking at an example (in the GitHub issue)
  fantasai: at the boundaries of the span, which line breaking rules
            applies when it has a different word-break prop value to
            the rest of the text
  fantasai: for white-space, the nearest common ancestor is used
  fantasai: The complication for word-break and line-break is that the
            determining rules for where you're allowed to break
            requires running an analysis on a lot of text
  fantasai: and every time that value changes you have to do another
            run, so impl wise it's a bit awkward
  fantasai: There's been some discussion about what's the best
            behavior here
  fantasai: I wanted to ask i18n if you have feedback on this issue
  fantasai: and ask the WG if this proposal to leave this undefined
            for L3, give impl time to experiment
  fantasai: Doesn't seem to be a terribly high importance case to
            solve at the moment

  florian: I think one of the more interesting cases -- and I support
           making it undefined -- is if the parent div allows a break
           between every latter, and two spans next to each other
           which don't
  florian: Can you break between the spans or not
  florian: Current spec says yes, but it's hard-ish to implement
  florian: If we need time to think about this, undefine it for a
           while, seems reasonable
  florian: but that's the kind of case this brings to the surface

  Rossen: Any objections to leaving it undefined?
  r12a: we should look at it as a group offline
  r12a: It's quite a long thread. I seem to remember someone brought
        up an example that didn't work
  nmccully: Are there layout engines working on this that would
            benefit from the extra time?
  fantasai: Part of the issue is the ICU APIs make it awkward
            for the rules to change in the middle of the line
  fantasai: so impl wise it's awkward
  fantasai: Could be factorial if you're changing it every other
            letter in the line
  fantasai: so there's some hesitancy to impl that given the current
            infrastructure
  fantasai: but there doesn't seem to be great solutions
  fantasai: Some of the behaviors you'd get from doing an easy thing
            would be non-symmetric
  fantasai: you'd be switching slightly less if you use the current
            rule in the spec, but that's all
  fantasai: There's not a high pressure to solve this and get interop
  fantasai: Look at it again in L4

  myles: Tangential comment, the general thing we're discussing is
         styling element boundaries
  myles: This is something letter-spacing also does
  myles: The spec says something that all browsers disagree with
  myles: With we do come up with a good way to describe boundary
         behavior, we should try to use this system to describe
         letter-spacing too
  fantasai: I think the spec is right on letter-spacing
  nigel: I think it would be good to have a general way to handle this
  florian: we have a current generalized rule, that is general, and
           does the right thing, and is painful to impl

  RESOLVED: It's undefined

  <myles> the presence of soft break opportunities between spans which
          change soft breaking opportunities is undefined heycam

  RESOLVED: i.e., the presence of soft break opportunities between
            spans which change soft breaking opportunities is undefined

Remove collapsible line breaks adjacent to word separators
----------------------------------------------------------
  github: https://github.com/w3c/csswg-drafts/issues/3481

  fantasai: We generally have this concept in CSS and HTML that you
            can use white space to format your source, and we collapse
            white space, including line breaks, down to a single space
  fantasai: essentially unbreaking the source lines to create a
            paragraph
  fantasai: For Chinese and Japanese which don't use spaces, we have
            some rules to remove the space; otherwise you will be
            forced to put entire paragraphs on one line always
  fantasai: There are some rules for doing that based on character
            classes
  fantasai: What we didn't consider thoroughly is languages that use a
            word separator that's not a space
  fantasai: We do special case ZWSP, for Thai and other languages
  fantasai: but we don't have something similar for Ethiopic word space
  fantasai: Probably don't also want a regular space added there
  fantasai: Proposal is when there's a word separator character
            adjacent to a line break, the line break just goes away
  fantasai: I think the characters that are affected here are Ogham
            space mark and Ethiopic word space and the Tibetan tsek
  <koji> https://drafts.csswg.org/css-text-3/#word-separator

  AmeliaBR: Does this map to something in Unicode? or do we need to
            maintain this list?
  r12a: I think there is something, not sure if it's fit for this
        purpose
  r12a: archaic scripts have other examples
  fantasai: [reads definition in the spec right now for word-spacing]
  florian: We need to maintain a list
  myles: Let's ask Unicode to do it
  myles: If there is such a facility for these character lists, hard
         to believe it's specific for the web platform
  myles: and not needed in text editors for example
  myles: I don't think the web specs should maintain this list
  florian: I agree with part of your statement, should try to work
           this out with Unicode
  florian: This one specifically maybe, but some are specifically web
           platform relatively
  florian: since this is relevant to turning HTML markup into text
  myles: There are many different markup languages...

  fantasai: There are 2 questions
  fantasai: if we want to do this, and then whether we maintain the
            list or if Unicode should
  addison: I think we want to do some research
  addison: space or no space is a classic problem
  addison: I would be surprised if there weren't something, but don't
           know off the top of my head
  addison: would be happy to engage
  myles: If this is a classical problem, it's been solved, and we
         should figure out how it's been solved in the past and re-use
         that solution

  fantasai: looking at some of the stuff in css-text, we have a
            concept of word separators
  fantasai: and it includes a set of code points
  fantasai: It excludes Ogham space mark
  fantasai: since it would cause text to not join any more
  [word-spacing has different considerations than white space
     collapsing]

  fantasai: So general usage in Unicode is text processing
            segmentation is not going to account for that concern,
            since they don't deal with typesetting
  fantasai: So there's gonna be some aspects of how we're using
            Unicode codepoints with specific requirements that haven't
            come up in Unicode's context so far
  fantasai: Unbreaking lines is something that's been hard to explain
            to them
  myles: Maybe we shouldn't be unbreaking them?
  fantasai: Too late for that!
  fantasai: HTML has been unbreaking lines for as long as it has
            existed, we want to make that ability available to more
            languages

  addison: fwiw I've had to write this code in the past, and it's not
           any fun
  addison: It may have been individually solved but not written down
  r12a: Like with the other issues, we need to look in more detail
  r12a: the Tsek is a syllable separator, not the same as a word joiner
  r12a: You could end a line with a Tsek, then start with more Tibetan
        on the next line, with indentation, and no real reason to join
        those together necessarily
  fantasai: You wouldn't make the Tsek go away, just avoid the extra
            space going in there

  ACTION: i18n to look this issue of word separators next to newlines

  ACTION: addison: ensure we respond to css 3481
Received on Tuesday, 3 December 2019 23:15:19 UTC