Re: Actions questions

Just want to make sure we're looking at the same spec? I've got a draft
from April 8 (retrieved from, but I remember an
older version that had a lot more example JSON.

Also, I wasn't at the meeting at Facebook London, so apologies if I've
missed out on any context. Comments below, inline.

On Wed, Apr 20, 2016 at 5:47 AM James Graham <> wrote:

> I am trying to understand the intent behind the actions section in the
> spec, so I can shore up the text a bit, and make it implementable. I
> have some of the notes from the meeting at Facebook London, but many
> outstanding questions, some of which are below.
> == Protocol ==
> I think the spec currently states that the top-level message structure
> is an object, like
> {"actions": []}
> Other diagrams indicate a list directly at the top level. Which should
> we choose? I feel like the former is closer in structure to the other
> commands in the spec, which all pass parameters as an object. It also
> provides some measure of future-proofing as we can extend the top level
> of the parameters in the future without ugly hacks.


> Inside this top-level list, the spec suggests further lists, but I have
> a diagram here that suggests an object:
> {"type": "key",
>   "id": "1",
>   "actions": [{"name": "keyDown",
>                "code": "a"},
>               {"name": "keyUp",
>                "code": "a"}
>              ]
> }
> Compared to what's in the spec, this allows an id attribute, which is
> needed for e.g. multi-touch. But it's more constrained in that each
> sequence of actions can only refer to a specific device, so if I wanted
> to have some pointer actions and some key actions, in a sequence, I
> would have to send multiple chains, padding with pause actions. Does it
> make sense to put the type and id on each action entry, like:
> [{"type": "key",
>    "id": "1",
>    "name": "keyDown",
>    "code": "a"},
> {"type": "key",
>   "id": 1,
>   "name": "keyUp",
>   "code": "a"}
> ]

In the example I'm working off of (which was from an older version of the
spec) the property name "type" is named "source". After a quick look
through the latest draft spec, I don't see this defined, so we should pick
one. I'm not fussed which.

> Should any of the fields be optional e.g. should it be OK to send an
> action without an id, and have the remote end use an implicit id for
> this undefined case? That would almost always be the right thing for
> keyboards, for example. If that doesn't happen, and the id format is
> "any string" it seems likely that local ends are going to have to send a
> uuid id as a default (to prevent it clashing with any later user-defined
> ids).

I think this should be dependent on the source type. I'm not sure how an
implicit id would work for multi-touch actions, but the keyboard case
sounds reasonable to me.

> Is it intended that the full payload is validated before any actions are
> taken? The current spec is specifically written in a way where the
> actions will partially complete if there is an entry half-way in with
> the wrong format, but I suspect this is an oversight.
In my current implementation, we attempt to do some validation on the whole
input before sending it to the browser. We really should specify that
remote ends do not fail half-way through an action chain, as this would
leave simulated input devices in an unknown state (e.g. we wouldn't know
which fingers are pressed and released).

> == General Semantics ==
> The assumption of the current specification seems to be that for actions
> that produce internal state, that state can persist for longer than a
> single API call, and that all such state is removed by the DELETE
> endpoint. However it's not clear to me if there's a usecase of this, or
> what the semantics are of releasing state (this also applies to the
> alternate scenario where state is released at the end of each API call).
> Consider for example:
> [
> [{keyDown a}],
> [{keyDown b}]
> ]
> When the state is released, does this work like sending {keyUp a} and
> {keyUp b} actions? Which order do such implicit actions occur in? Or is
> the idea that you just purge internal state without having any other
> effect? This latter option seems problematic if the browser or content
> assumes that it will always get matched pairs of certain events.

I thought the use case was to allow local ends to cancel a failed action
sequence. It was decided that local ends are allowed to break up sequences
to send separately (for performance reasons?) so this could be required to
support the case where the local end needs to reset the state after a
failed sequence.

There could be more context that I've missed though, as I wasn't around
when the first versions of this spec were written.

> == Pause Action / Temporal Ordering ==
> It is unclear to me how the temporal ordering is supposed to work in
> general. I assume that simultaneous actions are intended to happen
> left-to-right, top-to-bottom so that given
> [
> [{a}, {b}],
> [{c}, {d}]
> ]
> the order of starting each action would be a,b,c,d.

The top-level list is a list of "input sources", and each input source has
a sub-list containing an "action sequence". So if your example is a
multi-touch action, that would mean that "a" and "b" are performed on the
same finger, as are "c" and "d". This means that "a" and "c" are performed
"simultaneously", followed by "b" and "d" during the next tick.

> However it seems
> that the pause action can take a duration. What use cases is this
> supposed to cover, and when exactly do things happen?

The spec talks about "ticks", and in Chrome we've chosen this to mean 1/60
seconds. The pause duration allows clients to extend this "tick" duration
to perform an action more slowly.

> e.g. if I have
> (assuming pauses are measured in s for brevity):
> [
> [{keyDown a}, {pause 1},         {keyUp a}]
> [{pause 2},   {pointerMove 10 20}, {pause 3}]
> ]
> Should I expect the behaviour to be press down a, 2s pause,
> instantaneous pointer move to 10,20, 1 second pause, lift a, 3 second
> pause? Or are the events supposed to happen at some other point relative
> to the pause (e.g. in the middle of the tick?). Should events like
> mouseMove be "smeared out" over the tick duration somehow (e.g. a linear
> interpolation of the position firing an event every 16ms, or using
> requestAnimationFrame, or whatever).
We're implementing this by dispatching all of the events at the start of
the tick/duration, so this would match your interpretation of "press down
a, 2s pause, ...". For ease-of-implementation reasons, I'd prefer not to
have to interpolate the event timing here, but are there are use cases that
require this?

> == Elements ==
> It seems that some actions (keys, events) are supposed to be relative to
> an element, but there isn't anything in the protocol to specify which
> element. So, for those actions how does one supply the element?
Section talks about getting the "element id" (see steps 3-9).
There's similar prose for keyboard actions. Is this what you're asking

> == Key Actions ==
> Fundamentally I am unclear what key event model people want to
> standardise. I have seen lots of conversations around specific keyboard
> layouts and IMEs and so on. At the same time many platforms now don't
> present physical keyboards, and the kind of interaction you get from
> something like Swype doesn't seem possible to model in the current
> specification. I think interoperability is possible through a model in
> which key actions generate (well-specified) DOM events, and
> above-browser parts of the system (compose key, soft keyboard, IME,
> etc.) are abstracted away. Is there a strong reason that this simple
> model is not good enough?
> Is it expected that the keyboard model has key repetition e.g. if I do
> [[{keyDown a}, {pause 10}, {keyUp a}]]
> when the focus is on an input control, how many "a" characters should I
> see?
I'm going to avoid this for now, since I've mostly been working on pointer
actions recently. But I agree that this is underspecified and needs more
thought put into it.

> == Pointer Actions ==
> It seems like pointer actions are always specified relative to an
> element? Is this correct, or should it also be possible to specify
> relative to the viewport?

I thought we were using viewport-relative coordinates, but after re-reading
section it seems like this is relative to either the element or
the document. Did this change? Anyone know why?

> There is an open issue about dispatching touch events and other kinds of
> events. How will this be handled?
Currently the "pointer" actions are implemented as either "mouse" or
"touch" or whatever else is suitable for the platform. I think having a
common "pointer" input source makes sense for web apps that are trying to
be agnostic to the type of input they are receiving, but it isn't good
enough for tests that want to precisely specify the input source.

The spec currently requires that WebDriver implementations support
"keyboard" and "pointer". Implementations could support other input
sources, although it would be hard to write interoperable tests that prefer
non-standard input sources.

For example, maybe it would be nice if we could specify an input source as
"touch, pointer" to mean touch, but fall back to pointer if the
implementation doesn't support touch. Does anyone think this would be a bad

> There is probably a lot more to clarify, but that seems like a
> reasonable start…

Received on Wednesday, 20 April 2016 19:04:29 UTC