[csswg-drafts] [css-selectors-4] Clarify :lang() behavior when the language range is not a well-formed BCP 47 code (#8720) from jfkthame via GitHub on 2023-04-14 (public-css-archive@w3.org from April 2023)

From: jfkthame via GitHub <sysbot+gh@w3.org>
Date: Fri, 14 Apr 2023 12:42:39 +0000
To: public-css-archive@w3.org
Message-ID: <issues.opened-1668180635-1681476157-sysbot+gh@w3.org>

jfkthame has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-selectors-4] Clarify :lang() behavior when the language range is not a well-formed BCP 47 code ==
According to https://www.w3.org/TR/selectors-4/#the-lang-pseudo,

> An element’s [content language](https://www.w3.org/TR/css-text-3/#content-language) matches a [language range](https://www.w3.org/TR/selectors-4/#language-range) if, when represented in BCP 47 syntax [[BCP47]](https://www.w3.org/TR/selectors-4/#biblio-bcp47), it matches that language range in an extended filtering operation per [[RFC4647]](https://www.w3.org/TR/selectors-4/#biblio-rfc4647) Matching of Language Tags (section 3.3.2).

The text also goes on to mention that

> The language range does not need to be a *valid language code* to perform this comparison.

[my emphasis] which implies, as I understand it, that something like `:lang("qq")` will match content tagged with `lang="qq"` even though `qq` is not a *valid* language tag (as listed in the IANA registry).

However, the Selectors spec does not specifically address how *ill-formed* (not merely *invalid*) tags should be handled.

According to the language tag syntax given in https://www.rfc-editor.org/rfc/rfc5646#section-2.1, a tag like `åå` (containing non-ASCII characters) would be ill-formed ("the language tags described in this document are sequences of characters from the US-ASCII [[ISO646](https://www.rfc-editor.org/rfc/rfc5646#ref-ISO646)] repertoire"), as would a tag like `en---` (the various subtags following the primary language subtag are *optional*, but the grammar does not allow for them to be *empty*; if they're not present, the corresponding hyphen delimiters should also be omitted).

So how does `:lang()` matching work in the presence of ill-formed codes? It seems to me that a literal reading of the spec requires that such codes *never* match, because its definition of "matches" depends on "when represented in BCP 47 syntax", and such ill-formed codes cannot be represented in BCP 47 at all; they conflict with its basic grammar.

A possible alternative interpretation might be that the handling of ill-formed codes is simply *undefined* (because the spec only addresses what it means to "match" for codes "represented in BCP 47 syntax".

I'm not aware of any compelling use case for ill-formed language codes. So in the interests of clarity and interoperability I would like to ask the WG to confirm (and explicitly note in the spec) that `:lang()` matching is based strictly on BCP 47 and RFC4647, and as such, *ill-formed codes never match*.

(Note that the current implementation in WebKit *does* allow ill-formed tags to match. Thus if content is tagged with `lang="SomeRandomCode-Latn-US"`, which is ill-formed because the primary language subtag is too long, it is nevertheless matched by `:lang(SomeRandomCode)`, `:lang("*-US")`, etc. I think this should be considered a bug in the implementation.)

Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/8720 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Friday, 14 April 2023 12:42:41 UTC