[csswg-drafts] [css-content][css-fonts][css-text] Language-dependent behavior in CSS with ill-formed language tags (#7098)

jfkthame has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-content][css-fonts][css-text] Language-dependent behavior in CSS with ill-formed language tags ==
This question is about CSS-related behaviors that depend on the content language, and how they respond when the content has an ill-formed `lang` attribute.

Examples of CSS features that are affected include [font resolution](https://drafts.csswg.org/css-fonts-4/#generic-font-families) (e.g. whether generic families like `sans-serif` used with CJK content resolve to Japanese or Chinese font faces should depend on the language), [auto-hyphenation](https://drafts.csswg.org/css-text-3/#valdef-hyphens-auto), and generation of [quote marks](https://drafts.csswg.org/css-content/#valdef-quotes-auto) around `<q>` elements.

[HTML](https://html.spec.whatwg.org/multipage/dom.html#attr-lang) says that the `lang` attribute must be a BCP 47 language tag; and [BCP 47](https://www.rfc-editor.org/rfc/rfc5646.html#section-2) says that these are comprised of a sequence of subtags separated by hyphens.

However, we've seen content [in the wild](https://www.gutenberg.org/ebooks/search/) where the `lang` attribute uses an underscore instead of a hyphen to separate subtags, as in `en_US` or `en_GB`. As I understand it, according to BCP 47, such a tag is ill-formed, but it's not entirely surprising that such errors show up, as the underscore separator is used in [POSIX locale codes](https://en.wikipedia.org/wiki/Locale_(computer_software)#POSIX_platforms), and in [other major software systems](https://unicode-org.github.io/icu/userguide/locale/#the-locale-concept).

**Question: should browsers pay any attention to such language tags, even though they are not correct BCP 47 tags?**

Some testing indicates that current behavior is a bit haphazard. I've created codepen testcases to see whether an ill-formed lang tag affects (1) [font resolution](https://codepen.io/jfkthame/pen/vYWvNVg), (2) [hyphenation](https://codepen.io/jfkthame/pen/QWOzjRZ), and (3) [quote marks](https://codepen.io/jfkthame/pen/abVPNOV), or is ignored.

Results:

(1) In Webkit and Blink, the "bad" lang tag affects font resolution. In Gecko, it doesn't; but I just [landed a patch](https://bugzilla.mozilla.org/show_bug.cgi?id=1757578) to change this behavior, so that upcoming Firefox Nightly will behave like Webkit and Blink browsers in this respect. (This was before I realized quite how messy the current situation is. We could revert it.)

(2) In Webkit and Blink *on macOS*, the "bad" lang tag affects hyphenation, but in Blink on Windows, it doesn't. In Gecko, it doesn't on any platform.

(3) No browser pays attention to the "bad" lang tag for the purpose of generating quote marks.

Furthermore, as far as I can tell no browser accepts such tags in JS: calling `new Intl.Locale("en_US")` throws an error in all browsers I tested.

So on the JS side, things seem clear enough: only valid BCP 47 is accepted, anything else throws an error. But on the HTML/CSS side, it's a mess. Currently, Gecko never respects invalid tags, while Webkit and Blink do respect them for font-resolution purposes. And for hyphenation control, Blink may or may not respect them, depending on the platform.

Can we get some better interop here? Ideally, I think we should agree (and perhaps clarify in a note somewhere) that *only* well-formed BCP 47 language tags will have any effect on the content-language-dependent CSS features, and the browsers that are currently accepting ill-formed tags should stop doing so.

Alternatively, we should agree exactly what kinds of ill-formed tags *are* accepted, and record this in a spec so that we can all converge on compatible behavior. It makes no sense that `en_US` enables US English hyphenation in Chrome on macOS but not in Chrome on Windows; and it makes no sense that `de_AT` selects Austrian-German hyphenation but does *not* activate Austrian-German quote marks.


Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/7098 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 2 March 2022 16:58:28 UTC