Minutes from W3C M&E IG call: WebRTC for live media streaming

Dear all,

The minutes from the last Interest Group call on Tuesday 2nd October are available [1], and copied below. Many thanks to John and Peter for leading the discussion. John's introduction slide deck is here [2], and Peter's is here [3].

Our next call is Tuesday 6th November, following our TPAC F2F meeting. We'll announce details nearer the time.

Kind regards,

Chris (Co-chair, W3C Media & Entertainment Interest Group)

[1] https://www.w3.org/2018/10/02-me-minutes.html
[2] https://docs.google.com/presentation/d/1eDIUzwMeug_XAsRHOu-mzojBPWjlfOedodfkpKQ3tDk/edit
[3] https://docs.google.com/presentation/d/1_xQSoIdN-srjBc-GE_vuQMkxkaer2G-mZBchLPTyY20/edit

W3C
- DRAFT -

Media and Entertainment IG

02 Oct 2018

Attendees
        Present
                Kaz_Ashimura, Peter_Thatcher, Ali_C._Begen, Glenn_Deen, Chris_Needham, David_Waring, Francois_Daoust, Glenn_Goldstein, Hyojin_Song, Kasar_Masood, Kazuhiro_Hoya, Martin_Lasak, Masaru_Takechi, Nigel_Megitt, Peter_Pogrzeba, Peter_Thatcher, Matt_Paradis, Stephan_Steglich, Steve_Morris, Tatsuya_Igarashi, Wook_Hyun, Will_Law, John_Luther, Alec_Hendry, Barbara_Hochgesang, Chris_Poole, Adam_Roach, Giri_Mandyam

Regrets

Chair
        Chris, Igarashi

Scribe
        tidoust, cpn, kaz

Contents

Topics
        Introduction
        Web Video Lost a Feature in HTML5
        WebRTC for video streaming
        Q&A
        Peter's questions

Summary of Action Items

Summary of Resolutions

<tidoust> scribenick: tidoust

Introduction
Chris: Low-latency distribution of media is a very important issue for media companies, so glad to have this call.
.... John will give an introduction to the business case, and then we'll go into the detail of WebRTC with Peter.

<cpn> scribenick: cpn

# Web Video Lost a Feature in HTML5

slides https://docs.google.com/presentation/d/1eDIUzwMeug_XAsRHOu-mzojBPWjlfOedodfkpKQ3tDk/edit

John: This is something I've been hearing about in the market, the last few years.
.... (Slide 1) Previously, web video was done using plug-ins.
.... Holes have been filled, with the tag, MSE for adaptive streaming, EME for protected content, WebRTC for peer to peer.
.... Traditionally we've used RTMP in Flash, but this still a gap on the Web..
.... (Slide 2) Not too many options to replace that for real time streaming,
.... in particular extremely low latency live streaming. Not like HLS or recent use of DASH, where there's latency due to segment caching.
.... These can still be 10-60 seconds behind live.
.... (Slide 3) I was at IBC recently, hearing about this a lot. Use cases include live sports, gaming (i.e, gambling on horse racing), e-sports, video game streaming.
.... Low latency is extremely important.
.... Breaking news, for when important events are happening.
.... And I heard about something unexpected. Someone at IBC mentioned monitoring of industrial manufacturing processes using RTMP.
.... He's worried about not being able to do that once Flash is gone from browsers.
.... (Slide 4) There are people doing low latency streaming in one way or another.
.... Some solutions are not supported across all browsers, or the latency isn't low enough.

<igarashi> we should separate the low latency requirement from the one-to-many(multicast) requirement

John: HTTP Chunked transfer with CMAF.
.... LHLS, used by Twitter with Periscope.
.... Multicast HTTP/QUIC, a BBC effort.
.... SRT is a proposal from Wowza, not implemented in browsers, so still requires additional client software.
.... A number of WebRTC based approaches. My understanding is it's difficult and costly to scale, because each user requires a direct connection to the server.
.... Project Stream from Google, for low latency gaming.
.... I'll hand over to Peter now.

Peter: I should mention that Project Stream is using WebRTC

# WebRTC for video streaming

slides https://docs.google.com/presentation/d/1_xQSoIdN-srjBc-GE_vuQMkxkaer2G-mZBchLPTyY20/edit

Peter: I know how WebRTC works, not so much about normal video distribution, I have some questions for you too, so please help me.
.... I'm coming at it from a super low latency distribution point of view.
.... (Slide 2) Questions: Can we use WebRTC for video streaming, does it scale, can we use content protection?
.... (Slide 3) We can, there are different approaches, some are more certain, some more speculative.
.... (Slide 4) WebRTC has a data channel. If you're using MSE you could use SCTP instead of HTTP. The JavaScript receives over SCTP, passes it to the MSE API
.... (Slide 5) Pros: Server push would be a lot easier, with DataChannel it's quite easy.
.... With really low latency, it would allow out-of-order.
.... Also need a congestion control algorithm.
.... Only available for TCP in the kernel. SCTP runs in user space, so could be easier.
.... Cons: you have to implement ICE, DTLS, and SCTP on your server. That's a lot of new stuff.
.... Don't need to use PeerConnection on the server, but it is needed on the client.
.... There are still issues with latency.
.... (Slide 6) Some things are coming that may help. BBR will eventually be available in SCTP.
.... There's a proposal in the WebRTC Working Group to add SCTP data channels independent of PeerConnection, and also available in Service Workers.
.... (Slide 7, 8) RTP receiver has a buffer that's highly tuned for low latency.
.... Packets go directly from the network to the buffer, not via JavaScript.
.... Low latency, 20ms of audio, one video frame.
.... Adapts to network conditions quickly, keeping a steady predicted bitrate.
.... I don't believe you'd need to re-encode to send RTP from your server.
.... Cons: You'll need RTP on the server, SDP and PeerConnection on the client
.... To keep the latency on the buffer low, it will speed up or slow down audio to increase/decrease the buffer delay.
.... In a normal video call, you probably don't notice. Your ears may not hear it with voice, but it could be more noticeable with music.
.... The buffer is good at working around gaps in the timeline. If there's a 40ms gap in the timeline, it can conceal it.
.... For larger gaps, you get a robot-like voice that doesn't sound good. But it's better than dropping to silence.
.... WebRTC has no concept of rewind or history. So if you want to pause to watch content delayed or timeshifted, there's no way to do that with WebRTC..
.... There's a feature in to change the playback rate. It's not there with WebRTC.
.... You need to generate keyframes on demand on the server. The RTP client will send a signal to the server "I need a keyframe right now".
.... You need to be able to adapt the bitrate on demand. You can't send above that bitrate without introducing queueing.
.... (Slide 9) Some things are coming that may help: WebRTC without PeerConnection and control of the jitter buffer delay.
.... The Chrome implementation allows for this, but it's not exposed in the Web API, could be added as needed.
.... (Slide 10, 11) One thing that's easier with QUIC than SCTP, as it already has BBR, is bandwidth estimation to avoid queuing.
.... It's not in any browser yet, it's an editor's draft. being implemneted in Chrome, behind an opt-in flag.
.... (Slide 12, 13) Some speculative ideas, proposed in the Working Group, exposing a very low level decoder inside WebRTC. This would make the buffers inside MSE much more under your control.
.... If there was a low level decoder, you'd be able to control everything to your app-specific needs.
.... Cons: This doesn't exist yet, you'd have to write a JS / WASM library. And no-one has an idea of how EME or DRM would work into that.
.... (Slide 14, 15) Another speculative idea, can be done today, is writing codecs in WASM. Requires writing a lot of code.
.... Could be fine for audio, but a bigger issue for video.
.... Also no access to DRM using this approach.
.... (Slide 16) Technical gaps.
.... With SCTP and MSE, have difficulty putting this on your server.
.... RTP is difficult to implement on the server, audio acceleration / concealment, no rewind, also DRM.
.... (Slide 17) Does it scale?
.... If using the MSE approach, but replacing with a WebRTC transport, a limiting factor is adding ICE.
.... RTP is rather stateful.
.... You could parse the container formats you already have, and turn them into RTP packets.
.... And you need to be able to generate key frames on demand.
.... (Slide 18) ICE on the server.
.... We think of WebRTC as a peer to peer protocol.
.... ICE also works client/server. There's a mode, ICE Lite, that's easy to implement on a server.
.... All you have to do is ack some packets. It's a fairly simple thing, a couple hundred lines of code.
.... It needs a shared secret and some negotiation, share the secret among your servers.
.... On the other hand, QUIC may not require ICE.
.... (Slide 19) SRTP on the server.
.... Each server needs ot know the SRTP crypto key and server parameters.
.... Divide the media into small chunks, 20ms for audio, single video frames..
.... (Slide 20) Nothing aorund WebRTC changes the multicast story. They all do a per-client crypto handshake, so packets can't be multicast.
.... RTP has a mode where the crypto key can be shared, RTP SDES. This mode has been banned by the WebRTC Working Group.
.... Chrome has it, and plans to remove it eventually.
.... If you're doing multicast you'll need a proxy or satellite server, do multicast to there, then per-client crypto from there to the client.
.... (Slide 21) What about content protection?
.... If using QUIC / SCTP DataChannel with MSE, there's no change.
.... But it's not possible with RTP.
.... (Slide 22) MSE vs RtpReceiver.
.... I read through the MSE code in Chromium and the MSE spec. I never used MSE, so don't know how it's used in real life.
.... What would I do to get it as low latency as possible?
.... RtpReceiver is the transport, buffer and decoder.
.... A typical HTTP / MSE implementation has similar structure.
.... Can you theoretically get the same latency from MSE?
.... (Slide 23) HTTP with TCP introduces head of line blocking issues.
.... With containerized media, you need to wait for chunks to build up, which adds latency.
.... Buffers can be increased up to 3000ms audio, 9 frames of video.
.... There's no way to say that you want to interpolate.
.... (Slide 24) How to make MSE better?
.... You could use QUIC. There's no way to push QUIC streams into the browser.
.... To work similar to WebRTC, you want to push frames into the MSE buffer, 20ms audio.
.... I believe you can do that, something hacky, per-frame WebM, would be nice if you could inject a single frame into the MSE buffer.
.... Would be nice if the buffer had controls for delay and interpolation.
.... (Slide 25) What's needed? RTCQuicTransport is in progress.
.... If we added an appendFrame method, we could add the h.264 or VP8 payload with timestamps.
.... Limits on how much audio to expect, 20ms to 3000ms, would be useful to adjust these.
.... Would be nice to set the interpolation behaviour, or allow the video to skip ahead. Keep the audio going, the video looks frozen, then resume.
.... I have questions for you in the IG, but I'll take some questions from you first.

# Q&A

Nigel: You said that a con of RTP is that it doesn't offer rewind and history. I thought that was a feature of WebRTC.
.... Is there in general a way to specify a rewind point with WebRTC?

Peter: No, as soon as something it's played, it moves on.
.... If you want the ability to rewind, it's something you'd have to give up if using an RTP receiver.
.... There are different ways to use WebRTC. If you take chunks from the DataChannel you can keep the rewind.

<Zakim> kaz, you wanted to ask if it's OK to distribute the slides of John and Peter to the MEIG public list (and add the link to the minutes)

Kaz: Thank you for a great presentation, John and Peter.
.... Can we share the slides publicly?

Peter: Yes

Kaz: I wonder about if it's possible to synchronize multiple video streams and text captions using this?

Peter: WebRTC gives you tracks. If you put these in the same media stream, theoretically they'll be synchronized.
.... I know you can sync audio and video, but two videos in two separate tags.

<kaz> scribenick: kaz

Chris: This is something we've looked at at the BBC, using WebRTC for a vision mixing application, using separate streams.
.... There was lack of synchronization for multiple video streams.

<cpn> https://www.bbc.co.uk/rd/blog/2017-07-compositing-mixing-video-browser

<cpn> https://www.bbc.co.uk/rd/blog/2017-07-synchronising-media-browser

<cpn> scribenick: cpn

Francois: Thank you for describing the QUIC and MSE solution. It strikes me that this could be the simplest solution from a media perspective: QUIC is coming, MSE is here, so combine these.
.... What's the standardization status of the QUIC API? Can the M&E IG help?

Peter: It's in a funny state. It started in the ORTC Working Group, the incubation group. Google and Microsoft have been heavily involved.

<kaz> QUIC API for WebRTC

Peter: We're implementing this inside of Chrome.
.... At the last F2F meeting, the WebRTC Working Group was undecided whether to adopt it inside the Working Group.
.... I came away with the action item to keep bringing this up. We're incubating in the ORTC Working Group for sure.

Francois: Process-wise, the M&E IG could voice its view, to support the work, with use cases.

Peter: That probably would be helpful. Working Group members who were last supportive were asking for use cases.

Francois: Sounds like a good action for the Interest Group.
.... I am interested in the interpolation behaviour idea for MSE. We're researching scenarios for synchronization, as Chris mentioned, between videos.
.... For example, an animation where if the video stops we don't care. It could be attached to the HTMLMediaElement, as a behaviour we want for regular video too, not necessarily specific to MSE.

Peter: That's an excellent point. You could probably do the same thing for the other methods I proposed.

Igarahsi: Thank you for the presentation. I am wondering if there is any way to bind WebRTC with EME, using DataChannel,
.... so that encrpyted media frames are decrypted using EME?

Peter: If you're using SRTP or QUIC DataChannel could deliver encrypted media. If you set the keys, the media could then be decrypted.

Igarashi: I am wondering if you do this at a video or audio frame level (per-frame chunks)? So just using EME without MSE.

Peter: An EME implementation may have a lower bound on the chunk size.
.... I don't know if the EME mechanism would allow for chunks that small.

Igarashi: One issue with MSE is that frame handling should be handled by the UA, not the web app. That's behind my idea to use EME directly.

Peter: Yes, if the JS gets blocked, the video will stall.
.... It's a reason why this hasn't been adopted. It's not clear what the performance would be. We had an idea to use Worklets, like in Web Audio, also Houdini, the CSS pipeline.
.... If you're tring to do low latency, you do want to ensure the JavaScript doesn't pause.

Chris_Poole: I want to lend my support to the MSE based approach. It enables latency in the 4-5 second range.
.... For broadcasting, we don't need such ultra low latency. The idea to play through a lack of data, or conceal, these would be very helpful there.

<Will_Law> requests queue

Chris_Poole: Also, on per-frame chunks, with ISO BMFF you can put individual frames in chunks.
.... If you're aiming for compression efficiency and video quality, you do long-GOP encoding.
.... Generally people are doing large chunks.
.... But if you value low latency, you'll be doing forward prediction. Nothing stops you doing individual frames. With WebM or ISO BMFF you'll getting the timestamps.
.... Is API support needed, or do you get that anyway?

Peter: I was wondering that, so thank you for confirming.

<Will_Law> sorry, my audio seems not to be working. Will dial in

Martin: A question we have here is about scalability. Imagine an event like the FIFA world championship, what challenges will we face? Can we have thousands or more concurrent users?

Peter: I don't see why not. For each client, you'll need a server that packetizes and encrypts the contents. So will need a lot of servers.
.... The closer the server is to the client geographically, the better. Works better with lower round-trip times.
.... There's nothing inherent to block scalability, other than needing lots of servers.

Martin: The servers would need to support the protocols you described.

Peter: Yes, the front end servers would need to get the content from somewhere, then packetize and send to clients.

Will: Back to the thread about MSE. I like the idea of QUIC getting data quickly to the client.
.... It seems to me that putting JavaScript in the way and using appendFrame doesn't seem like the way to go.
.... Can we hook up the MSE to a stream, and remove JavaScript from this?

Peter: Yes, a colleague of mine brought that up. I couldn't find appendStream in the code or the spec.
.... Where does the stream come from on the QUIC side? A WHATWG ReadableStream.
.... If you're doing something where you put all the media into a QUIC stream, We could go into how to map media into QUIC streams.
.... We have head-of-line blocking. But you wouldn't know what serialization there is. You could say this is one big chunk of WebM or MP4.
.... How are you going to take each individual stream and plug this down into MSE?

<tidoust> [FWIW, appendStream was removed from MSE before publication, because streams were not ready at the time, seems nothing is blocking re-introduction of appendStream now: https://github.com/w3c/media-source/issues/14]

Peter: What you're describing could work, but not if you're using many QUIC streams, or if you're doing something fancier over the wire using the QUIC transport.
.... It would only work in some specific scenarios, and not for everything.

<Zakim> tidoust, you wanted to wonder we can characterize "lots of servers" in comparison to current situation

Francois: Back to the scalability discussion, can we characterize the "lots of servers". Today, if you use a DASH or HLS based infrastructure, would you need many more servers than that? How does it compare in terms of processing power?

Peter: I'm not familiar with those kinds of deployments, so it's hard for me to say.
.... It's not CPU intensive at all. Yuo may not have the same crypto speed for SRTP encryption.

# Peter's questions

Peter: Is MSE widely adopted?

Chris: Yes absolutely. It's the building block for adaptive streaming. Libraries such as DASH.js build on top of it.

Peter: Seems like people are interested in interpolation, but appendFrame not so much.
.... How about liveness and delays? Maybe different heuristics.

Will: I think this is important. I'd like to see the ability to set a target latency. And then for the media element to adjust the playback rate.

Peter: I can talk to my coworkers who work on MSE, and tell them that people want better live stream support.

Igarashi: I'm interested in using EME directly, and concerned about JavaScript performance of appending frames.

Peter: What about the server protocols, is this a pain?

Will: I can answer on behalf of Akamai. We've deployed QUIC. It uses twice as much CPU per bit delivered. A cost issue, maybe also an optimisation issue.
.... We're in favour of QUIC, see no problems with ICE.

Peter: Next question is around the ability to alter the media, changing bitrate.

Martin: I think this is not very practical from a video delivery point of view. If you do DASH or HLS, you've decided on specific bitrates, the keyframe are kind of fixed. This would require re-encoding of the video which would be costly.

Peter: What kind of key-frame frequency is typical for live streams?

Martin: In HLS, recommendation is 10 seconds, 6, down to 2 seconds. Each segment starts with an I-frame, so you can join every 2nd second.
.... You can't join the stream at any random point.

Igarashi: To support quick channel-change, we should support 1 second.

Peter: About fixed bit rates, it's surprising that you want this. It's not going to work at all with live streams in varying network conditions.
.... What if the network goes down?

Martin: Of course, you adapt, but with different bitrates and adaptation sets. You design the DASH or HLS stream in advance to cope with different network conditions, and switch between them.

Peter: That would work fine with differerent network conditions. You can select the quality level, video conference systems work like this already.

Chris_Poole: Adaptation between bitrates, we have a few options for seamless changes, could be tricky with ultra-low latency, so you don't have much opportunity to change without introducing a glitch.
.... The other thing to note is that the potential for adaptation is normally linked to P-frame and I-frame locations. Today, you can't do that adaptation at any point in the stream. You have limited scope for doing something before needing error concealment.

Peter: There is the option of using scalable codecs. VP9 has different quality or resolution layers, so you can drop down wihtout requiring a key frame. It's used in video conference systems for just this. Are any of you using this?

Will: No. In a video conference system, there's controlled endpoint hardware. There isn't a universally available solution we can work with.

Peter: It's in VP9, and also in AV1. Is it because you're in the h.264/h.265 world?

Will: Yes

<Zakim> kaz, you wanted to to ask if we want to have more discussion with the WebRTC WG during TPAC

Kaz: Given the discussion here, should we continue this at TPAC?
.... Could you come to the M&E IG meeting on Monday?

Peter: I'll be at TPAC, happy to talk to people there, will be in Working Group meetings, happy to have dinner or lunch.

<kaz> Chris: Thanks and looking forward to seeing many of you at TPAC!

<kaz> [adjourned]

Summary of Action Items
Summary of Resolutions
[End of minutes]
Minutes formatted by David Booth's scribe.perl version 1.152 (CVS log)
$Date: 2018/10/04 14:51:33 $


-----------------------------
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-----------------------------

Received on Friday, 5 October 2018 09:29:50 UTC