Re: Super-academic, highly-abstract meta-modelling time: The Media Path from Stefan Håkansson LK on 2013-02-15 (public-media-capture@w3.org from February 2013)

From: Stefan Håkansson LK <stefan.lk.hakansson@ericsson.com>
Date: Fri, 15 Feb 2013 14:51:06 +0100
To: public-media-capture@w3.org
Message-ID: <511E3D4A.5060907@ericsson.com>
Hi,

this is a really good overview, and quite in-lined with my 
understanding/view of things.

A few comments:

* I like the division between "energy" and other constraints

* Is there really any point in having other than "min" constraints for 
energy? I mean, you'd want a minimum resolution, frame-rate, ... but 
would what would the gain of setting a "max" be? (OTOH, max would be 
important in a network transport situation in order to not send bits 
that would have no value for the consumer at the other end of a 
PeerConnection - but this should be dealt with in the webrt WG)

* As far mandatory and optional constraints go, I would be fine in 
removing the optional ones, but then we should standardize a way for 
(certain) consumer settings to travel upstream (up-track?) over a 
PeerConnection - and yes, that part is for another WG. I am primarily 
thinking about video width and height. Even in the case the app has no 
wish to use mandatory constraints for this the source should adapt to 
the display surface of the video element.

* Could not also an "energy" constraint/setting lead to an 
overconstrained situation? If you put min framerate to 40, while the 
camera can produce a maximum of 30.

* (This is not related to your mail, but a reflection) Most constraints 
seem to be applicable both for source selection and for applying 
settings to already selected sources. But, for source selection the by 
far most usable constraint seem to be the direction of the camera 
(front/rear) - but this is(?) unusable in the settings as the camera 
can't be moved from the front to the rear of the device.

* I share your fear regarding complexity. It is not only that we need to 
spec this up in a way that can be interpreted in only one way and that 
implementers have to implement it all - we will also build a huge space 
for testing where we need to verify that different implementations 
behave the same (or at least similar)

--Stefan

On 2013-02-15 02:05, Martin Thomson wrote:
> I was asked to more clearly elucidate my concerns with various
> constraint proposals that were in the throes of being proposed during
> the interim.  Had I been sufficiently forewarned, perhaps we could have
> avoided something of a lengthy discussion, but then we'd never had come
> to this email, which I think you will find is highly enlightening.
>   Though the extent to which the enlightenment is relevant to the work
> of this task force may vary.
>
> I initially reacted poorly to the thought that constraints on tracks
> could imply that something would perform processing on those tracks.
>   That didn't fit the model I had...at the time.
>
> Here's the expanded model and how I think that constraints like aspect
> ratio can fit that model.  I've talked to Travis about this, and I think
> we agree on the high level points.  I believe that this is close to the
> model he used to build the settings proposals.  However, Travis hasn't
> seen this email yet and my first draft was totally incoherent.  So...
>
> (tl;dr version: see the picture)
>
> __*Actors*
> I think that we all agreed to this basic taxonomy:
>
>   * *Sources*: Camera, microphone, file, blob, RTCPeerConnection,
>     processing API, …
>   * *Connector*: MediaStreamTrack*
>   * *Sinks*: <video> tag, <audio> tag, RTCPeerConnection, processing
>     API, recording API, …
>
>
> Where cardinality is:
>      Source (1) .. (*) Connector (1) .. (*) Sink
>
> *__A Picture*
> 1000 words worth of goodness.  Apologies for the size.
>
> Inline images 2
>
>
> *__Summary of Conclusions*
> (Thanks to Travis for these.)
>
>   * "Selection" is a process that evaluates changes in connector
>     constraints and attempts to choose a source, plus an operating mode
>     on the source
>   * Selection has an associated "scope" (of operating modes)
>       o If a connector doesn't have a source, then its scope doesn't
>         apply (it selects from the empty set)
>       o If a connector is in the process of acquiring a source (from
>         getUserMedia), it's scope is all available [Note 1] operating
>         modes of all available sources
>       o If a connector has a source, it's scope is limited to the
>         operating modes of its source (only)
>   * Selection with multiple connectors has a specific policy, depending
>     on constraint, one of:
>       o "greatest common energy [Note 2]" (the operating mode selected
>         is the one that maps to the highest energy need of the connected
>         sinks)
>       o "pick one only" (the operating mode cannot be in multiple
>         contradictory states, e.g., the fill light can't be both on and off)
>       o special - as defined by the constraint, some choices might not
>         be mutually exclusive (on and auto)
>
> Note 1:  Sources (and operating modes) can be rendered unavailable by
> having other tracks connected to them.  Some operating modes are
> mutually exclusive.  For example, you can't have the fill light both on
> and off based on different constraints; or, an encoding camera might
> operate in a way that is not compatible with having multiple users of
> that data, so the browser simply disables sharing for that camera.  This
> can also happen through explicit action, but we need to define those
> special "constraints".
>
> Note 2: The concept of 'energy' needs further explanation.
>
> *__Energy*
> Energy is just a word that has little meaning in this context.  Energy
> == information, but in a qualitative fashion only.  The energy a source
> produces is the amount of information the track conveys.  A higher
> resolution contains more energy, a higher sample rate contains more energy.
>
> Processing can reduce energy safely, either by frame dropping, scaling
> down, cropping, adding black bars, etc...
>
> In contrast, scaling up doesn't add energy, it just pads with bits that
> contain no information.  Thus, certain sinks will be (and should be)
> unwilling to scale up.  For example, RTCPeerConnection doesn't want to
> send pointless extra stuff on the wire - it should be able to learn of
> the actual energy of the track and refuse to use anything that is scaled
> up, while scaling down as circumstances dictate.  If you want to scale
> up real-time video, scale it up on the receiver!  (BTW, don't infer new
> API requirements from this, this is purely internal browser-stuff.)
>
> When multiple tracks are attached to the same source, each might set a
> constraint on energy.  Any constraint that limits energy is ignored -
> for the purposes of selecting a track.  Any constraint that imposes a
> minimum level of energy is used to determine which source and operating
> mode is selected.  The highest energy constraint from any track attached
> to a source is what determines its actual operating mode.
>
> For example, if track A wants 1080 lines minimum and track B wants 480
> lines minimum, track A wins and the camera produces 1080 lines.  If
> track B also wanted 480 lines /maximum/, then it will have to apply some
> processing to get that.
>
> Constraints/settings that follow this rule include resolution (height
> and/or width), frame or sample rate, bits per sample (if we could be
> bothered with this).  Minimum values are used to select sources or
> operating modes, maximum values are sent to the processing box.
> Cropping/letter-boxing are always processing instructions.
>
> *__Other Settings, Implications and Interactions
> *
> The other settings that we've seen (fillLightMode) directly affect
> sources.  These are easy: constraints can't specify conflicting values.
>
> However, this implies that the first track to apply a given setting
> determines the operating mode for the source.  As long as that track
> lives, its setting is the one that wins and other tracks are either
> unable to attach to the source, or unable to apply another setting.
>
> This is not ideal when settings interact.  We might manage as long as
> error feedback indicates that the error is due to there being other
> constraints on other tracks.  Or maybe we need to expose both the set of
> all possible modes along with the current set of possible modes, noting
> of course that the track that made the current setting could change it
> at any time.  That could result in an API that is a little hard to
> explain properly.  I don't have a good answer to this problem.
>
> In general, the model also implies that tracks don't report the actual
> "shape" of a track.  Tracks can report the settings that are currently
> in effect and any optional settings that could be.  But tracks cannot
> say that the video flowing inside is this or that resolution - it could
> change, and should be permitted to.  It might be OK to provide an
> indication of the current source operation mode, with a clear warning
> that this is volatile and not under direct application control.
>
> *__Double Processing*
> There are two places in the media path where processing logically
> occurs.  Implementations will naturally collapse those.  For instance,
> two lots of scaling can be reduced to a single scaling operation in most
> cases.  However, sometimes this will result in ugliness.
>
> The best example of ugly would be a 16:9 source that is sent through a
> 16:10 constrained track to a 16:9 video tag.  In that case, the correct
> thing to do is to display a nice black frame around the video, unless
> one of the aspect ratio changes cropped rather than black-barred, in
> which case...
>
> *__Example*
> We can apply this model to answer the important questions:
> What happens if you constrain/set a track to width=640,height=480 for a
> 1920x1080 camera source?
>
> If you consider the model, you reach two conclusions:
>
>   * the source only needs to provide 480 lines worth of energy, though
>     it may provide more, it could just pipe out 1920x1080 video
>   * the data that is provided to the sink is scaled and cropped (or
>     letter-boxed) to 640x480
>
> Adding another track (with no constraints) results in output of
> 1920x1080, depending on what limits are implicitly applied by its sink.
>
> *__Make It Simpler, /Please/*
> One major thing we could do to simplify things is to dump the idea of
> mandatory vs. optional constraints.  This model supports a lot of
> flexibility without having "soft" constraints.  Anything you don't care
> that much about can be applied as settings after connecting the track.
>
> I can actually see how this model could be considered *way *too
> complicated as it is without optional constraints.  It's already hard
> enough to implement.  More importantly, as a user of the API, it's very
> difficult to understand the model sufficiently that you can choose
> optional constraints that produce sensible, or even predictable, outcomes.
>
> *__Render This All Moot*
> By allowing applications to gain access to information about sources and
> to connect sources to local playback sinks prior to gaining consent.
>
> (In the same vein: Harald did make a mildly convincing argument for
> allowing this after consent for one stream was granted, based on the
> premise that once you can grab an image using your camera, there isn't
> much left that fingerprint has to do.  That didn't account for very
> tightly controlled sources, or tainted streams, however, so I'm not sure
> we've reached that particular place just yet.)
>
> *__End Transmission*
Received on Friday, 15 February 2013 13:51:34 UTC