[webrtc-nv-use-cases] standardized support for captions/subtitles (#70) from Ben Wagner via GitHub on 2021-02-24 (public-webrtc@w3.org from February 2021)

From: Ben Wagner via GitHub <sysbot+gh@w3.org>
Date: Wed, 24 Feb 2021 16:25:44 +0000
To: public-webrtc@w3.org
Message-ID: <issues.opened-815637229-1614183943-sysbot+gh@w3.org>

dogben has just created a new issue for https://github.com/w3c/webrtc-nv-use-cases:

== standardized support for captions/subtitles ==
The [WebRTC Next Version Use Cases doc](https://w3c.github.io/webrtc-nv-use-cases/) lists three use-cases under [Funny hats](https://w3c.github.io/webrtc-nv-use-cases/#funnyhats*) related to plain-text associated with media: [Captioning, Transcription, and Language translation](https://github.com/w3c/webrtc-nv-use-cases/blob/f937a4fdf26d4b2dedefce371425a952be8ad05b/index.html#L279). However, there do not seem to be any new requirements listed related to handling the human-readable plain-text generated or manipulated in these use cases.

Later, there is a requirement [N23: The user agent must be able to send data synchronized with audio and video.](https://github.com/w3c/webrtc-nv-use-cases/blob/f937a4fdf26d4b2dedefce371425a952be8ad05b/index.html#L664), however, I don't think that covers the support required to handle captioning, transcription, and language translation. The receiving side must be able to interpret the data as human-readable text, which implies the format of the data should be further standardized.

I propose that the doc explicitly states that these use cases require sending/receiving human-readable text in parallel with other media such that a received WebRTC stream directly connected to an HTMLMediaElement will have [textTracks](https://html.spec.whatwg.org/multipage/media.html#timed-text-tracks) representing the sent text. It should also state that the text tracks can be generated and processed similarly to other raw media streams in requirement [N19: The application must be able to insert processed frames into the outgoing media path](https://github.com/w3c/webrtc-nv-use-cases/blob/f937a4fdf26d4b2dedefce371425a952be8ad05b/index.html#L644).

If this is not standardized, supporting these accessibility-enhancing features becomes much more difficult. Applications must invent a protocol for the text tracks and include code to encode/decode them to/from data channels, transforming them into calls to TextTrack.addCue. More likely, we'll see so-called "open captioning" where the text is rendered onto the video frames. Open captioning makes it impossible for users to adjust the format, size, location, etc. of the captions based on their needs, makes it impossible for the browser to automatically translate the captions to the user's language, and potentially covers/hides important information in the video. Open captioning also doesn't work well for users who have difficulty both hearing and seeing.

Additionally, for the language translation use case, we should consider supporting the [kind](https://html.spec.whatwg.org/multipage/media.html#dom-audiotrack-kind) and [language](https://html.spec.whatwg.org/multipage/media.html#dom-audiotrack-language) categorizations for WebRTC audio and video tracks.

Please view or discuss this issue at https://github.com/w3c/webrtc-nv-use-cases/issues/70 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 24 February 2021 16:25:46 UTC