Re: Rationalizing new/start/end/mute/unmute/enabled/disabled from Stefan Håkansson LK on 2013-04-08 (public-media-capture@w3.org from April 2013)

From: Stefan Håkansson LK <stefan.lk.hakansson@ericsson.com>
Date: Mon, 8 Apr 2013 11:47:24 +0200
To: public-media-capture@w3.org
Message-ID: <5162922C.9080105@ericsson.com>
On 04/07/2013 05:45 AM, Randell Jesup wrote:
> We've been around this drain before... see 
> http://www.ietf.org/mail-archive/web/rtcweb/current/msg04624.html and 
> followups.  Perhaps we can make it to the exit this time.... :-)

Agree! It seems to me that you're bringing the attention to two main topics:

1. The possibility to change the source of a MediaStream or a 
MediaStreamTrack. I can definitely see the logic in supporting this 
(though I have some comments inline on alternative approaches to support 
the use-cases). And now that we are incorporating functionality that a) 
give each source an id and b) allows for creation of a MediaStreamTrack 
first and then assign source to it, it does not seem very far fetched. 
But my questions would be: How soon can we have this detailed out? Do we 
need it in the first version?

2. We should define how a saved stream (and perhaps other media files) 
can be converted to a MediaStream. Using the media element is one 
option, but would not meet the requirement of allowing the user to fool 
the application - something we have discussed we should support. That 
would require that a file can be picked by the user in the getUserMedia 
dialogue. But I have the same questions as above here!

Some more comments inline.
>
> On 4/4/2013 7:01 AM, Stefan Håkansson LK wrote:
>> On 4/3/13 5:29 PM, Randell Jesup wrote:
>>> On 3/25/2013 5:55 PM, Martin Thomson wrote:
>>>> I think that it's fair to say that the current state of the
>>>> MediaStream[Track] states is pretty dire.  At least from a usability
>>>> perspective.
>>>
>>> I generally agree with the approach here.  I also agree that the
>>> MediaStream should be an explicit rollup of the states of the tracks
>>> (just as I feel we need a stream.stop() in addition to track.stop(),
>>> even though you can build one in JS yourself).
>>>
>>> One thing I really want to see described (doesn't have to be in the
>>> spec) is how an application can provide a "Hold" operation where live
>>> video is replaced with a video slate and/or pre-recorded
>>> animation/audio, and do it without an offer/answer exchange. The
>>> MediaStream Processing API would have made this fairly simple, but 
>>> since
>>> we don't have that, we need to define something.  WebAudio (if/when
>>> implemented) may help (with some pain) for the audio side, but doesn't
>>> help us with video.
>>>
>>> The real blocker here is giving a way for a MediaStreamTrack (in a
>>> MediaStream that's already AddStream()ed to a PeerConnection) to get a
>>> different source.  Currently, the only easy way I can see it is very
>>> kludgy and probably higher overhead/delay than we'd like:
>>>
>>>         video_hold.src = my_hold_animation.ogg;
>>>         elevator_music = video_hold.captureStreamUntilEnded();
>>
>> What is this? An undocumented feature of the media element?
>
> It's a (very useful) API originally part of the Media Processing API 
> (you can look at the last draft of that).  Takes the output (decoded 
> audio and video) of a media element and uses it to source a 
> MediaStream.  We absolutely need it if we want any way to feed an 
> encoded/saved stream into a PeerConnection.  (We can record messages, 
> but we can't play them back except maybe through a canvas (ugh)).  We 
> can't even have a "Sorry, I'm not here right now, please leave a 
> message" without something like this.
>
> Se we need it (or the equivalent) for all sorts of reasons, not just 
> "Hold"/Mute/etc.
I agree, and if we are to meet the requirement that the user should be 
able to cheat the app to use a file instead of the camera as source in 
getUserMedia it would not be sufficient to do it by going via a media 
element. Do we need this in version one?

>
> Firefox has had this since our MediaStream code landed most of a year ago.
>
>>
>>> The only alternative I see in the current spec might be to have two
>>> tracks always, and disable the live track and enable the Hold/slate
>>> track - but that would still cause a negotiationneeded I believe before
>>> it took affect.
>>
>> I will not argue that being able to switch source for a 
>> MediaStreamTrack is useless, because I think it could be useful. But 
>> switching source could very well also lead to that a renegotiation is 
>> needed (I take this from what Cullen said in Boston: if the current 
>> source encodes with codec A but the other with codec B you'd have to 
>> renegotiate anyway).
>
> True, but not relevant.  If you need to negotiate, you do so.  And 
> I'll note that MediaStreams holding already-encoded data is I believe 
> under (or not) specified, other than people waving their hands and 
> saying "we'd like to have a camera that encodes and not have to decode 
> and re-encode it".  It explicitly doesn't define a canonical 
> representation - but it also doesn't specify anything related to that 
> or how things hooked up to MediaStreams can deal with incoming data, 
> which causes confusion in the current question.
>
> The closest it comes to defining this behavior is (in defining 
> "source"): "A source can be a physical webcam, microphone, local video 
> or audio file from the user's hard drive, network resource, or static 
> image."  Also "When a ||MediaStream| 
> <http://dev.w3.org/2011/webrtc/editor/getusermedia.html#idl-def-MediaStream>| 
> object is being generated from a local file (as opposed to a live 
> audio/video source), the user agent /SHOULD/ stream the data from the 
> file in real time, not all at once. ", and "User agents /MAY/ allow 
> users to use any media source, including pre-recorded media files" and 
> in discussion of implementation: "in addition to the ability to 
> substitute a video or audio source with local files and other media. A 
> file picker may be used to provide this functionality to the user."
>
> That's about all that's said about sources other than cameras.  We 
> really need to flesh this out - and decide what actually should be 
> required, or decide what is optional. 
> video_element.captureStreamUntilEnded() has an advantage of making it 
> possible for anything that can be a video source into a source for a 
> MediaStream (Media Source API, streaming video, etc).

Agree to that we need to flesh this out.

A question on video_element.captureStreamUntilEnded(): does it capture 
only what is rendered, or also tracks that are not played? And for the 
case of multiple audio tracks: those are mixed by the media element when 
played. Will those individual tracks be present in the captured 
MediaStream, or will there be just one audio track (representing the 
mixed audio)?

How well have you specified it, is there text available that could be used?
>
>>
>> If I understand the use-case you describe here correctly, you'd like 
>> to switch to Muzak being played out at the remote end when you mute 
>> locally.
>>
>> The straghtforward way of meeting this use-case with the current set 
>> of APIs would be:
>>
>> 1. As the application loads, download the Muzak source file
>>
>> 2. Use one video element to render the live a/v stream from the 
>> remote party, and a separate (muted) audio element that has Muzak as 
>> source - set up in a "loop" fashion
>>
>> 3. When the user at the remote end presses "mute me" in the 
>> application, have the app a) disable the audio track and b)send a 
>> "play Muzak" signal to the peer
>>
>> 4. When "change to Muzak" is received, unmute the Muzak audio element 
>> (no need to mute the video element as silence is being played)
>>
>> 5. Same goes for unmute -> signal "stop play Muzak" -> mute the Muzak 
>> audio element.
>>
>> There are many other options as well.
>
> Ok, an application could do that, but that moves the Muzak from the 
> source to the target - that doesn't work if the target is a different 
> app, or the target is behind a gateway, or if what you want to play is 
> local to the sender.  I'll note I've done exactly this in the past to 
> indicate Video Mute (show a local slate to say the other side Muted - 
> that way users don't go "it's black - something's broken with the 
> video!" (and they do)).  However, I had to move to a source-side mute 
> once we had to deal with any type of non-homogenous network.
I'd say this is the main crux: do we need source side Muzak for version 
one, or is it enough to do at the target? I think that you could agree 
when designing interoperable apps in a way that makes target side Muzak 
work (you could send the Muzak file over the data channel), and I don't 
see how gateways would be an issue.

And if we got a file -> MediaStream interface (e.g. 
video_element.captureStreamUntilEnded()) implemented, you could also do 
source side Muzak by setting up two audio tracks over the PeerConnection 
(and disable/enable).

But the real problem would be legacy devices. If you have a legacy VoIP 
device that can handle one audio RTP only, and that can't play a file 
(that it can't get over a non existent data channel) there is no 
solution except for switching source at the sender, or having some kind 
of gateway to do magic tricks.

In principle I agree, being able to switch source of a 
MediaStream(Track) would be a natural to have (and needed for certain 
legacy interop cases).

>
> And some apps will do it this way.  But there are plenty of reasons to 
> believe people will want to change the source of a MediaStream (or 
> MediaStreamTrack(s)).  Another random/silly-but-real example: if you 
> want to do Reindeer Antlers on someone's video image, you'll need to 
> change from direct getUserMedia()->PeerConnection to add a canvas 
> inbetween (or some such), and that means re-routing the data on the 
> fly, unless you set up the entire pipeline from the start "just 
> incase" it was needed (and, BTW, doing so would blow any attempt to 
> keep the data encoded - see above) - or you'd have to re-negotiate on 
> add and remove - and this gets more painful, since you'd need to drop 
> an old stream, and add a new one on each transition (building up 
> m-lines) or disable the old one and add/enable the new one maybe.
If we end up in the one m-line per source solution - that is still a 
debate I think. (Note that there here also is the possibility to do the 
processing at the receiver, controlled by a signal from the sender - but 
that would not be the natural way to do it perhaps.)

>
> Another example: I want to (in my app) to be able to smoothly and 
> quickly switch between front and back cameras.  I don't want to have a 
> offer/answer exchange to do this.  Or switch mics.
(Here you could also have a MediaStream with two video tracks sent to 
the other end, and switch at the target. Maybe not the most natural way, 
but doable.)

>
>>
>>>
>>> p.s. For Hold and Mute in a video context, I rarely if ever want 
>>> "black"
>>> as an application.  I may want to send information about Hold/Mute
>>> states to the receiver so the receiver can do something smart with the
>>> UI, but at a minimum I want to be able to provide a
>>> slate/audio/music/Muzak-version-of-Devo's-Whip-It (yes, I heard that in
>>> a Jack In The Box...)
>>
>> There is also the "poster" possibility with the video element (i.e. 
>> an image to be played in absence of video).
>
> Right; that's what I refer to as a 'slate'.  Sorry, video/cable 
> business dialect.  :-)
:-)

>
> -- 
> Randell Jesup
> randell-ietf@jesup.org
Received on Monday, 8 April 2013 09:47:47 UTC