Re: [w3c/uievents] Clarify `keypress` event handling for keys that map to non-BMP Unicode symbols (Issue #346) from drwez on 2023-07-11 (public-webapps-github@w3.org from July 2023)

From: drwez <notifications@github.com>
Date: Tue, 11 Jul 2023 04:40:18 -0700
To: w3c/uievents <uievents@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <w3c/uievents/issues/346/1630667594@github.com>
> One problem here is, browsers need to keep storing the last surrogate pair if .key of keyup needs to be set to the surrogate pair. I don't know whether there is an API to get last unicode point which was introduced by the preceding WM_KEYDOWN, but I guess there is no such API. (Similar issue occurs for .key of keyup in a dead key sequence.)

Browser implementations under Windows could certainly attempt to "collect" the first UTF-16 surrogate rather than propagating it, and then only emit an actual `keypress` if/when the second surrogate `WM_CHAR` is received - that would be a similar conceptually to the dead-key handling logic.

> Is it known that IE and EdgeHTML use the same system API surface as Gecko and Blink? Notably, https://learn.microsoft.com/en-us/windows/win32/inputdev/wm-unichar seems to exist.

As per the documentation you linked, `WM_UNICHAR` is provided only as a convenience for use by applications to inject Unicode character input without having to decompose it to UTF-16 code-units.  While the default message-handler will decompose it into a pair of `WM_CHAR` messages, for applications that don't handle it explicitly, it's not a message that the system itself ever sends.

> > The keypress behaviour shown for IE and EdgeHTML don't really make sense, since the charCode field has a bogus value
>
> Indeed the charCode part doesn't make sense. However, the key field is consistent with Safari: https://hsivonen.fi/screen/safari-adlam.png . This is a pretty strong indication that it's Web-compatible to emit one sequence of keyboard events per Unicode Scalar Value and to represent the Unicode Scalar Value as two UTF-16 code units in the key field.

Sadly, not really - non-BMP keyboard input is still incredibly rare, so it seems plausible that it's not a case that folks are noticing is broken with their implementations yet.

> > That the two implementations differ in their choice of charCode value suggests that the behaviours were artefacts of an implementation choice, rather than a conscious decision.

> Yes, but the bogus values suggest that it's not that likely for the Web to be relying on charCode, which means it's quite possible that it would be feasible for other engines to align to Safari's behavior, which (absent Web compat constraints to the contrary) is clearly the best behavior (no unpaired surrogates, charCode integer shows the same scalar value as the key string).

See above; non-BMP is still so rare that I suspect we're just not (yet) seeing folks impacted by the brokenness of `charCode` in some implementations.

> The Chrome Mac behavior (https://hsivonen.fi/screen/chrome-mac-adlam.png) also suggests that it should be Web-compatible to align to the Safari behavior.

Chrome Mac isn't emitting `keypress` at all in that example, so I don't think it's relevant to the question?

> > From my perspective, the primary problem here is when 2 separate event sequences are sent when the user enters a single character.

> I think the primary problem with splitting non-BMP characters across events is that (as far as I know) this is the only case where the environment that JS/Wasm runs in introduces unpaired surrogates. In every other case, environment-supplied DOMStrings are actually well-formed UTF-16 and the only way for a site-supplied program to get an unpaired surrogate in a string returned by a browser API is to first offer an unpaired surrogate as input to a browser API.

I think Gary was referring to the fact that Firefox emits two separate `input` events, for the two surrogates (as does Chrome on Windows).

Firefox and Chrome Windows are consistent with historical behaviour of `keypress` in this regard - the main issue that they have is that they're then continuing on to emit two distinct `input` events, which goes against the spec but happens to "work", for the most part.  I think we're all in agreement that the browsers should fix that. :)

Since the spec for `keypress` is not specification but rather historical documentation, we're constrained, I think, to documenting the set of behaviours that content might need to content with, which currently includes:
1. Two `keypress` each holding one surrogate code-unit in `charCode`.
2. One `keypress` holding a whole Unicode code-point in `charCode`. 
3. No `keypress` at all.
4. One `keypress` holding only the first surrogate of the pair in `charCode`.
but clearly some of these behaviours are more reasonable/helpful than others. :)

> Notably, Chrome on Windows treats Adlam, which is an actual keyboard layout, as an IME even though it treats the emoji touch keyboard as a keyboard!

That's an interesting observation! Both behaviours seem technically valid, though the Firefox behaviour seems more useful. I wonder what the difference there is.

> Considering that Chrome on Windows doesn't even appear to treat non-BMP keyboard layouts as keyboard layouts (even though IE, EdgeHTML, and Firefox treat them as keyboard layouts), I have a really hard time believing that the Web Platform couldn't converge on the combination of Safari and Windows 10 touch keyboard behaviors:

The Web Platform has converged on behaviours for `keydown`, `input` and `keyup` (though some implementations are buggy particularly with regard to `input`, as we've discussed).


`keypress` is a legacy event maintained for compatibility with older sites & frameworks, though - as Gary said:

> Note that that entire section is non-normative. We do not intend to normatively specify keypress or the deprecated keyCode and keyChar attributes, although we can certainly add implementation notes.

So the spec can document reasonable behaviour in the hope that new implementations will adopt it, and even that existing implementations will converge where feasible without breaking compability too much, but the situation differs from the normative specifications.  As a concrete example, if Chromium were to migrate `charCode` to hold the whole Unicode code-point then that will break sites that use `String.fromCharCode()` to process the field; they'd need updating to use `String.fromUnicodeCharacter()` to remain compatible.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/w3c/uievents/issues/346#issuecomment-1630667594
You are receiving this because you are subscribed to this thread.

Message ID: <w3c/uievents/issues/346/1630667594@github.com>
Received on Tuesday, 11 July 2023 11:40:23 UTC