Minutes of the December 2, 2020 WEBRTC WG Virtual Interim from Bernard Aboba on 2020-12-07 (public-webrtc@w3.org from December 2020)

From: Bernard Aboba <Bernard.Aboba@microsoft.com>
Date: Mon, 7 Dec 2020 13:05:03 +0000
To: "public-webrtc@w3.org" <public-webrtc@w3.org>
Message-ID: <MN2PR00MB0655EACC21369AB97F500655ECCE9@MN2PR00MB0655.namprd00.prod.outlook.com>

Minutes of the WEBRTC WG Virtual Interim
December 2, 2020
Henrik Bostrom, Scribe

Insertable Streams Raw Media (Harald)

Harald reminds the audience of the three stages of the Breakout Box presented at previous Virtual Interims.
Stage three is to allow generating and consuming tracks directly.

Harald: Guido started implementing stage three directly because it was not much more difficult than stage two.
ProcessingMediaStreamTrack can be shimmed in O(10) lines code. We therefore suggest adding stage two an example rather than as an API surface. Is that OK?

Bernard, Henrik: Yes.

Harald: Should we specify a face-tracking API?

Youenn: Being able to define meta data for this may be useful. Maybe limit it to local tracks?

Harald: We could add a metadata to the frame, which may belong to WebCodecs.

Tim: This is useful for Funny Hats because I don’t see how you can address Funny Hats without face tracking, so I don’t think they’re separable.

Youenn: It’s not just face tracking, it could be more than that (other examples mentioned like “Funny Tongue”).

Tim: The risk is that if we don’t start somewhere we end up not doing face tracking properly. I think we need to address it.

Chris: There are use cases for Breakout Box that don't involve face tracking, and there are use cases in WebCodecs without face tracking.

Dom: It sounds separable enough from Media Capture and Streams that it could be addressed as a separate API and there is already a
face tracking n API in WICG:
https://wicg.github.io/shape-detection-api/

Harald: Should an example be added to the Shape Detection API (WICG spec)? Then we can discuss whether we need to do more here.

Harald: Insertable Streams Raw Media Status: an experimental implementation available in chromium and the spec work has started.

Takeaways:

Breakout Box stage two will be a shimmable example in the spec rather than a standalone API. Harald is working on spec. Guido has a working implementation.

It’s not clear if this working group should define a Face Tracking API or if this work should be handled by WebCodecs or Shape Detection API (WICG). But it is clear that there is interest.

WebRTC Insertable streams Encoded Media as a transform (Youenn)

Safari supports SFrame transform, JS transform and combos. JS transform = background thread by default. One attribute was added to the RTCRtpSender and RTCRtpReceiver: RTCRtpTransform. See example JS and WebIDL in the slides.

Youenn illustrates that the transform can be used not just alone but combined with other transforms using pipeThrough(). So you can add JS-specific transforms and combine these with browser-implemented encryption. Conclusion: Combos are working fine!

Youenn: Can we update insertable streams API to using transform model? The answer is yes, with WhatWG TransformStream.

Questions to the working group:

Add SFrameTransform to insertable streams draft spec?

Update insertable streams draft spec to transform model?

Harald: Can it be modified or is it set once? We discovered that it is important to handle the first frame and streams are weak for reconnection.
Secondly it is important that the stream be multi directional (to allow feedback mechanisms) and I worry this makes it single directional.
Example: telling the encoder to go down in resolution.

Youenn: In the current API there is a method for requesting a key frame.
That’s one way we could implement the feedback, at least for keyframes. This even works from the receiver side all the way to the encoder.
I am not aware of other cases, but I don’t see why we couldn’t expand with additional APIs to talk to the encoder when needed.

Tim: What if the transform changes the bitrate requirements?

Jan-Ivar: From an API point of view I think this is an improvement, previous API you had two ends of a cable.
The transform stream makes the intent clearer and lessens the risk of wiring up the cables incorrectly.

Harald: But if this is only a temporary stage, and we actually do want to break things apart, then the transform model doesn’t make sense.

Youenn: It’s difficult to judge the two approaches without clarifying what type of signaling/feedback mechanism is needed.

Harald: Action item on myself to specify what I think we should do next.

Youenn: But is there support adding SFrameTransform?

Jan-Ivar: I support it.

Bernard: I don’t support this before considering use cases beyond E2E encryption. There are use cases (such as addition of substantial metadata)
that cannot be supported today because they could break congestion control. Support for signaling/feedback mechanisms might enable those use
cases, which cannot be represented in a transform model.

Jan-Ivar: This constrains some of the use cases we are trying to solve, and solves some of the use cases we originally attempted to solve.

Youenn: The known use cases are supported. What other use cases are needed?

Bernard: The Virtual Reality Gaming use case:
https://w3c.github.io/webrtc-nv-use-cases/#vr*

In this use case, the feedback model comes into play (since the metadata can be substantial), so it is important to solve that first.

Jan-Ivar: But the original API doesn’t support that either?

Bernard: Not yet.

Youenn: The other use cases are unrelated to senders and receivers.

Tim: I like this. With the original API I wondered if it is a transform or not, this API answers that question for me.

Jan-Ivar: The original problem we wanted to solve was solving encryption, and this solves that elegantly.

Guido: I support SFrameTransform except that the key was exposed to JavaScript, which is a problem. Is there a way to use it without exposing the key to JS?

Youenn: The crypto key could either be extractable or non-extractable.

Bernard: We’re over time. Should we follow up on the mailing list?

Harald: I think we have rough consensus for defining an SFrame transform in some API, but we don’t have consensus on how to do it.

Youenn: Let’s follow up when Harald has defined use cases.

Takeaways:

Youenn/Safari have a working implementation of SFrameTransform and proposed that the Insertable Streams spec be updated as to
1) add SFrameTransform for encryption, and
2) switch to using the transform model for all transforms.

There is consensus on having an SFrame transform, but consensus was not reached on how to do it, specifically whether or not we should go in the direction of TransformStreams or have the “both ends of the wire fully exposed” that has previously been implemented.

More discussions are needed after Harald has defined use cases trying to answer the question about what type of feedback mechanisms/signaling is needed.

getCapabilities (Jan-Ivar)

Jan-Ivar presents slides we did not get to in the TPAC joint meeting with PING.
Privacy issues have been filed regarding leaking HW capabilities without permission in getCapabilities as well as in getStats().
However most information is already available in SDP which is needed to set up a peer connection, as well as in other APIs.

Jan-Ivar: We already have fingerprinting notices in the spec about this.
The Graphics Hardware Fingerprinting document notes that this information is already available in Web-GPU, Web-GL and performance API.
So solving this in WebRTC alone does not address this.

Proposal: Add a note relating to the implementation status of hardware permissions.

No objections.

Allow piggybacking getCapabilities() on most recent Offer or Answer (Henrik)

Henrik: I’ve prepared two slides trying to clarify what requirements getCapabilities() has to fulfil, because the spec is quite unclear about what it needs to or doesn’t need to return or if this could change over time.
If this is true it could break the intended use cases or its usefulness could be implementation-specific.

But because we are already over time, I volunteer to skip my slides. To be revisited the next interim.

Media Capabilities API extended to WebRTC (Johannes)

Johannes is trying to solve the use case of selecting an efficient encoder/decoder configuration before the stream starts.
We have getCapabilities() today, but it only says which codecs are available and does not give any expected performance for the codecs returned.

There already exists a Media Capabilities API that can be queried for “supported, smooth or powerEfficient”

Johannes: I propose we extend the Media Capabilities API to support WebRTC.

Add webrtc as MediaDecodingType and MediaEncodingType.

Add scalabilityMode to VideoConfiguration to query for SVC support.

Clarify MIME types for WebRTC.

Example JS for this on a slide.

Youenn: I like this, we see use cases where WebRTC is used for decoding only.
So having an API like Media Capabilities as a future replacement to getCapabilities() would be good, its model is better and it is asynchronous.
It’s easier to handle fingerprinting too.

Jan-Ivar: I like this. Is this enough to deprecate getCapabilities()?

Henrik: No because getCapabilities() also returns RTP header extension capabilities.

Henrik: How would the browser know if something is “smooth” or “power efficient”?
WebRTC could be used in different use cases (single stream, 50 person meeting).

Youenn: That’s true but this is much better than what we have today.

Chris: This was previously discussed in Media Capabilities. We found that we shouldn’t be too worried about the exact numbers, it’s more important in relative terms.
For example if a SW encoder is as comparable in performance to a HW encoder we say it’s efficient.
I would be happy to expand on this more formally, but the spec is very vague on this point.

Johannes: Hasn’t been much concern with regards to privacy yet. The existing API does not have any concerns as far as I know.
We could reference the hardware permissions problem discussed previously (the conclusion to the getCapabilities slides).

Chris: When we discussed this the assumption was that most devices fall into general buckets.
For example most MacBooks look like most other MacBooks and so there shouldn’t be too much of a fingerprinting surface.
But there hasn’t been a PING review yet.

Jan-Ivar: I think our working group is OK with this. Hopefully a privacy review will go well.

Dom: It would be useful to know how this would be used in combination with WebRTC.
If there would be any significant changes to user interaction for example if a prompt was added in the future.

Takeaways:

Strong support to extend Media Capabilities for WebRTC.

Authors may need to ask for a PING review.

Expose camera presets (Youenn)

Web pages tend to prefer camera presets, web pages have difficulty selecting them.

Proposal: Add a more straightforward API similar to how it is done by the OS by native applications.
To avoid fingerprinting issues, this is put on MediaStreamTrack level (after granting camera permission).

Example slide: getUserMedia() + get track.presets + do track.applyConstraints(presets).

Youenn: A bonus of this is that we could use mock presets to create a mock capturer and enable WPTs for constraints processing!

Henrik: I like this but I have concern about doing this in two steps (getUserMedia first, applyConstraints later).
Wouldn’t this imply opening, closing and re-opening the camera causing visible glitches?

Youenn: Possibly but there are ways to mitigate this, for example you could delay the opening of the camera until a task after resolving the promise, giving the application time to applyConstraints() inside the resolved promise.
Then you only have to open the camera once.

Jan-Ivar: Why not just pick an ideal resolution close to the one you want and you end up with a native resolution?

Youenn: Yes but what about frame rate? At least on macOS the frame rate support on different resolutions/cameras can vary widely.

Jan-Ivar: Could we expand resizeMode:”none” to also find native frame rates?

We had to cut the discussion short because of time. Takeaways:

It would be possible to expose presets of a camera after the permission prompt (making it useless for trackers) and without closing and re-opening the camera.

Presets could also be used to create mock capturers and make constraints processing testable in Web Platform Tests.

Discussion was cut short before consensus was reached.
Let’s follow up next interim?

WebRTC-NV Use Cases (Tim)

Tim has created a PR and is looking for feedback: are there any of the following use cases that we don’t want to include as an NV use case?

Three are two categories: existing use cases and new use cases.
Existing use cases nearly work in WebRTC 1.0 or work inefficiently, but they’re not great.
These include:

* Improve remote UX for pre-empted media calls
* Extend IoT use-case to work in isolated networks
* Non-discriminatory FunnyHats
* Add local in-browser SFU/MCU for conferencing

New use cases include:

* Decentralized web over datachannel for Matrix, |pipe| etc
* Low latency P2P broadcast for Auctions, Betting etc.
* Reduced Complexity Signalling

Tim: Improve UX for pre-empted media calls on mobile. The local user knows what happened when a GSM call is received, but the remote user doesn’t know why it froze.
We should add a use case to include a way to “park” a connection and recover it later. Now you drop the call completely.

Tim: Extended IoT use-case to work in isolated networks.
IoT often communicates via LAN (e.g. baby monitor), it should be possible to reconnect even if the internet connection or signaling server is down.

Youenn: Couldn’t the device have an HTTP server to do the signaling?

Tim: These rarely have up-to-date certificates for HTTPS. Issues with establishing ICE connections.

Tim: Face tracking / non-discriminatory FunnyHats. But this was discussed earlier.

Tim: Browser SFU/MCU - Cloud conferencing. A group member can encrypt and send copies of the encoded media directly to multiple group members.

Bernard: Are you requiring the browser to receive simulcast as part of this? That’s complicated. What is the requirement?

Tim: Not necessarily, what I want to achieve is I should be able to mix the receiving stream of all participants (e.g. Meet’s layout) and send out a mixed stream of all participants.
The problem is if I do this (e.g. canvas capture and send to every participant) I have to encode it for every participant which is inefficient.

Tim: Decentralized internet. Ability to intercept fetch API and service it over a P2P link. E.g. data channel in service worker.
All of these use cases kind of work already but inefficiently and ugly.

Tim: Low latency P2P broadcast. Mass broadcast. Autoplay does not work well. DRM over data channels, ability to reuse subtitle assets.
Bernard: DRM is supported over data channels, if you send containerized media over the data channel, and render it using Media Source Extensions.

Tim: Reduced complexity in signaling. URI format to define remaining transport related fields.

Tim: Are there any of these that we don’t want to talk about?

Harald: I see them all as valid use cases.

Bernard: Me too.

We had to cut short because of time.

Received on Monday, 7 December 2020 13:05:27 UTC