[webrtc-nv-use-cases] Detailed example of potential value of A/V/data sync in WebRTC (#74)

darkvertex has just created a new issue for https://github.com/w3c/webrtc-nv-use-cases:

== Detailed example of potential value of A/V/data sync in WebRTC ==
[_I originally wrote this in the [W3C strategy repo](https://github.com/w3c/strategy/issues/133) and @dontcallmedom suggested I drop an issue here for greater visibility, so here goes..._]

I'd like to share you a first hand VR-related use case where synced A/V/data could have been very useful at my job:

My team needed to deliver N concurrent synced video feeds from a multi-lens VR camera rig from a location with poor computational capacity (too low for live-stitching panoramic video onsite.) For reasons I cannot disclose we needed to livestream 360 video with a VR camera that wasn't able to livestream a stitched 360 video natively out of the box. The workaround we decided was to receive the individual feeds elsewhere with a more powerful computer and produce the stitched 360 monoscopic (ie non-3D) panoramic video to stream to whatever. (You can conceptualize each lens feed as an RTP/WebRTC video track.)

Our camera had 8 physical lenses horizontally but with just 4 we already had sufficient panoramic coverage so to save some bandwidth we only send 4.. but which 4? Depends what's visible and near what lens; maybe we want 4 even or 4 odd. We designed the sending software to let you pick a subset of the cameras. We can switch which are active on a whim, mid-stream.

One approach to dynamic feed switching in WebRTC could be to prenegotiate all the tracks possible that you could need and only send video on those that you consider active, but it's a little tricky to distinguish between a video feed being suddenly inactive because it was intentionally disabled due to a reconfiguration at the sender VS a video feed being suddenly inactive because there was some network congestion / data loss down the pipe. Renegotiating WebRTC video tracks between configuration switches is possible but we felt it interrupted the flow considerably as it added overhead, so we didn't go with it.

We needed the camera configuration and identity metadata to be timestamped with the video frames so it could correlate in perfect sync during reconfigurations. (Feed 1 may be showing camera 0, but maybe five seconds later it's showing camera 1, for example.) Identity matters for a realtime panoramic stitch because they are different perspectives in space and the algorithm must be kept informed or it'll look wonky. Unsynced changes are not useful as it glitches the result of the 360 processing and a WebRTC data channel (to my knowledge) could not do this with _today's_ WebRTC generation.

We absolutely needed the camera identity to be in sync with the video frames. Since WebRTC data channels fell short, we simplified further down and settled on a pure RTP approach. We opted to hijack the outgoing H264 bitstream and inject NAL units of type SEI (Session Enhancement Information) Subtype 5, aka "[unregistered user data](https://fuchsia.googlesource.com/third_party/ffmpeg/+/refs/tags/n4.3/libavcodec/h264_sei.h#34)" SEI , in before the frame image data in the RTP video track data. You can slip small amounts of userdata (text or json or whatever) in the video feed this way and not corrupt anything. All video players safely dismiss it.

On a custom software receiver (not a regular browser) you can recover the H264 packets from the RTP track, recover the original NAL units, read the metadata before a frame and your video processing can react accordingly and instantly since the metadata changes in perfect sync with the video frames.

---

If either data channels could be in sync with video without codec gymnastics OR if another convenient mechanism existed for a generic timestamped metadata stream, I think we may have stuck with WebRTC for our use case. (I personally would have liked that as it could have made it easier to debug things from a web app in-browser instead of some custom standalone software.)

Ultimately, data being in sync with video is important to any kind of "realtime actor" with a need for a status HUD, for example:
- imagine a first-person-view flying drone web app where you can control it and there's a HUD overlay showing the live gyroscope data in perfect sync with the video,
- or one of those creepy walking robot dogs and there's charts graphing the servo rotations overlayed on top and you can see exactly when one of them jams because you know it's in sync with the video showing you the same.

Sending device health and state information in perfect sync with the video feed is crucial for a trustworthy assessment of what's happening on screen of a remote entity. Being able to do this in an official and reliable capacity would be very exciting!

---

Sorry if I was a little verbose in my explanation. Hope it helps shed some light on why synced A/V/data could open the door to some very handy, exciting and useful in-browser use case scenarios!


Please view or discuss this issue at https://github.com/w3c/webrtc-nv-use-cases/issues/74 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 2 February 2022 18:01:11 UTC