Re: Units of measurement and scrolling in actions


On Thu, Jan 12, 2017 at 4:08 PM, James Graham <>

> On 12/01/17 10:41, David Burns wrote:
>> tl;dr; The only size we can use is CSS pixels as that is what browsers
>> know. More answers inline.
>> David
>> On 12 January 2017 at 09:26, Simon Stewart <>
>> wrote:
>> Hi,
>>> TL;DR: should units for size and distance be consistent throughout the
>>> spec? And if a scroll is needed when using Actions, how should that be
>>> specified, or is it implicit?
>>> Long version:
>>> While reviewing James's PR for pointer events
>>> <>, I noticed two things.
>>> 1/ We have no way of knowing the size of the viewport, or where an
>>> element
>>> is within the viewport.
> Except by executing javascript.

Is there a reliable and simple way to do that? Google Closure has a bunch
of code for figuring out this stuff:

I don't think there's a clean way of doing this from the local-end without
supplying scripts to do this work. That seems prone to failure.

Essentially, we need this information to accurately create an action
sequence which will execute as a user expects.

> 2/ We have no way of knowing the size of an element within the viewport in
>>> anything other than CSS reference pixels.
> I don't think there are other units that make sense.

I completely agree. We've discussed on the IRC channel, My understanding
was that "clientX" was in "pixels", and these referred to pixels in the OS.
After the discussion, you, David, and I believe that they're CSS reference
pixels. That makes life a lot easier :)

> 3/ We have no text on how to handle the case where an element is outside
>>> of the viewport.
> In what situation? For pointer move that should return "move target out of
> bounds" with the new PR.

With the new PR, yes. That feels deeply unsatisfying for most of the
use-cases I listed. Worse, it'll appear flaky to people where the viewport
size is not consistent (eg. local vs CI runs)

> In order to help give some context to the discussion, consider three
>>> separate use-cases:
>>> A/ A user expects "get element rect, calculate half the width, perform
>>> pointer move to element, perform second pointer move by that half width"
>>> to
>>> be the same as "get element rect, calculate half the width, perform a
>>> single pointer move with element and xoffset of the half width" to cause
>>> the pointer to end in the same place.
>>> B/ A series of interactions begins starting from element A and ending at
>>> element B, who's final x/y location is determined algorithmically and
>>> isn't
>>> known in advance. Until the interactions begins, element B is not within
>>> the viewport, and the size of the viewport is unknown --- on local test
>>> runs, the display is 2880 x 1800, but when running on a "webdriver as a
>>> service" provider, the screen size is 1024 x 768.
> Note that we currently resolve element coordinates only when an action is
> dispatched. So if you have a situation where you have multiple actions and
> the element is not in view until action M < N but is the target of a
> pointer move in action N, that all works fine.
> What doesn't work is the case where in a single pointer move action both
> causes an element to appear in the viewport and targets that element. But I
> think that's a totally reasonable restriction; if the browser doesn't know
> what the target is until the move starts, what should be the initial vector
> of move? To resolve the ambiguity the author must split a move into two
> parts, one with a manually specified target, and one with the element as
> the target.

Mid-way through preparing the action sequence, how does the local end know
that there needs to be two separate steps rather than one?

C/ A user wants to start the pointer move in one frame, and end in
>>> another, performing a drag of (for example) an email into (for example) a
>>> folder of a web-based email app.
>>> Breaking these down, "a" and "2" show that we have a problem with the
>>> units used for specifying distances and sizes in webdriver. Most of the
>>> time, it's CSS reference pixels, but in Actions, we flip to using
>>> locations
>>> within viewports. We don't provide a mechanism to translate between the
>>> two. It would feel that consistently using CSS reference pixels
>>> throughout
>>> would be simpler for an end-user to understand, though more complex to
>>> implement at the remote end (since you now need to convert from reference
>>> pixels to a clientX/Y)
> I think you are misunderstanding the use of "CSS [reference] pixels". They
> are just a unit, more or less the foundational unit of layout on the web. A
> particular set of coordinates has both a unit and an origin; the underlying
> issue here seems to be the use of different coordinate origins in different
> parts of the WebDriver protocol.

See above. My problem wasn't with CSS pixels, but with the unit used for
pointerEvents. David points out that the specs don't necessarily make this
clear, but the implementations we looked at appear to use CSS pixels.

> However, I'm not sure whether "c" would complicate using css reference
>>> pixels: what if a user had changed the zoom level in one frame but not
>>> the
>>> other? Should we even allow drag motions between frames?
> CSS pixels get larger when the user zooms.

And this is the problem I was worried about. I think we're safe now.

> At TPAC Shenzen we decided that this was not a use case (C)  we were going
>> to support. Notes at
>> There are
>> possible security sandboxing issues. There is also the issue doing an
>> implicit switch_to_frame to the new frame, doing the relevant look up for
>> the element and what to do if its stale. When the Action Chain is
>> finished,
>> which frame do you end on? Since there implicit frame switch people could
>> be expecting either case and this could lead to a footgun.
>>> It also seems clear that we need some mechanism to cause a scroll to
>>> happen mid-way through a series of (pointer) actions. We could do this
>>> implicitly (which would make "b" possible), by asking someone to specify
>>> a
>>> scroll action (from the null input device?), with a delta and an optional
>>> target element (which also makes "b" possible), or by returning some kind
>>> of error stating that scrolling would be necessary to complete the action
>>> (which may make "b" impossible).
>>> In Shenzen, we said we didnt need such an API (however the new Actions
>> API
>> wasnt on the table at that point). We did, however, in SF
>> talk about scrolling
>> to
>> elements for different commands and how it would be good to turn this on
>> and off. Perhaps this needs to be either an Actions "task" or it needs to
>> be a property in the actions blob sent over the wire. I don't mind either
>> way.
> I am in favour of making scroll a primitive action rather then something
> implicit. Indeed the design of actions with an input type of "none" is
> specifically designed with this future extension in mind. I think this is
> necessary for use cases like infinite scrolling where the page is expected
> to dynamically resize  That said I think it should be a future extension
> and not something we do right now.

Since we have no implementations yet, and it provides useful functionality
to users, why not add it now? It's easy to spec out as we already talk
about scrolling elements into view (the complex case), and the underlying
implementations will need this primitive to work with the high-level
interactions commands.

> My ideal outcome as a user would be:
>>> * All distances and sizes are always given in CSS reference pixels.
>>> * Scrolling happens thanks to a "scroll action" added to the events, or
>>> when a user specifies a target element in another action.
> So, I think what you mean here is "everything is in document-origin
> coordinates". I don't really see the advantage of this. For moving to an
> element the coordinate system isn't very relevant. For moving to a specific
> point viewport origin coordinates seem easier to reason about because you
> know that any location in the range (0->width,0->height) is a valid
> coordinate, without having to consider scroll position. This seems much
> more natural for the gestures use cases, and other things that don't
> involve interaction with specific elements, than document-origin
> coordinates where you always need to adjust for scroll position.

What I mean is that if a user gets the element rect, divides by two, then
performs a pointer move with the element as the target and that as an x
offset, the pointer should end up at the edge of the element regardless of
zoom level.

I'd also like to be able to have our actions APIs provide a mechanism to
interact with elements outside of the viewport, possibly by bringing them
_into_ the viewport. This has to happen as part of the action sequence in
order to be efficient.

> A painful but possibly workable solution would be:
>>> * Provide a mechanism to get the current viewport size.
>> Within the Actions commands? Why can't we just use #executeScript for
>> this?
See above. I've yet to see this be consistent between browsers, and it
looks like Google's Closure (amongst others) has plenty of code to try and
deal with the differences. It seems remarkably error prone to ask all
local-end developers be aware of these subtleties.

> * Provide a mechanism to get the size of the currently active frame in the
>>> viewport.
> I agree that there should be a way (not as part of actions) to get the
> size of the viewport. The fact that get window size includes window chrome
> seems to me to be a clear bug that I wouldn't put in the spec except for
> legacy compat concerns. In particular I can't see a use case where knowing
> the actual size of the os window, instead of the size of the content area,
> is important.

I'd be okay adding a new command to get the size of the viewport, or
piggy-backing it on the return value of Get Window Size.

> I'm not sure what the use case for the size of the frame is.

I'd like to execute Actions within a frame and add the required scroll
actions to the sequence to get everything to work.

> Again, where would we put this command and why can't we use #executeScript?
>> * Add additional properties to "get element rect" to return the client
>>> x/y/width/height of the element, assuming that it was scrolled into the
>>> current viewport.
>> It already returns that information. It doesnt return viewport positions,
>> unless you are using #executeScript and using the JS
>> element#getClientBoundingRect()
> Well what it returns at the moment is the position in document-origin
> coordinates. "The x,y assuming that it was scrolled into the document"
> doesn't make much sense, because an element could have many locations
> whilst being scrolled into the viewport.
> I agree that the primitive allowing viewport-origin coordinates might be
> an improvement.
> * Provide a scaling factor for converting between CSS reference pixels and
>>> client position
>> Historically, we havent supported people changing the scaling in their app
>> and told them they need to fix it. See IEDriver as an example.
>> * Make local ends do the maths for users
>> Fine by me.
>> * Make scrolling explicit.
>> Fine by me
>>> The former seems simpler from a local end PoV, but I'm unsure how much
>>> work it would take at the remote end.
>>> I've come round to the idea scrolling should not be implicit, since it
>>> makes use case "c" a PITA to implement.
> Is there anything that wouldn't be fixed by making "Get Window Size"
> optionally exclude the browser chrome (i.e. return the size of the content
> area), Get Element Rect optionally return viewport coordinates, and
> eventually adding a scrolling primitive to actions?

I think that those would work, though we'd need to change the semantics of
"Get Window Size" to return the viewport size of the currently active frame
as well as the current top level browsing context's size including chrome.

> I think these things are all possible and not too hard. I don't think any
> of them are high priority however.

My goal is to get Actions working reliably for people when they're using
WebDriver level 1.


Received on Thursday, 12 January 2017 14:30:19 UTC