Re: Frustrations of captions video delivery

Hi Silvia, all,

thanks for raising this. I recognise many of the same frustrations, and I'd add that browser folk do not seem motivated to revisit the topic, and that I wonder if the complex patchwork of differing global legal requirements adds inertia.

tl;dr: Three main problems, and there's inadequate pressure for change:

  1.  Native support for formats
  2.  Balance between privacy and usability
  3.  Lack of usage data

Native behaviour in browsers (formats)
The [public service UK] BBC does not use native web rendering of subtitles at all: my conclusion is that it is simply not possible to provide a consistent accessible experience that way, as implemented today. The only generically re-usable construct in the current web standards, as I see it, is the ability to schedule close-to-accurate media-time triggers using Text Track Cues. That's not a great state of affairs. And as Dan Sparacio showed in the linked video, there is important missing functionality from the Text Track API, like being able to remove text tracks.

Possibly it is useful to consider two communities of content providers:

  1.  Individuals or small non-media-focused organisations uploading or hosting their own content, almost always complete videos rather than live streams.
  2.  Large, possibly commercial, organisations with complex delivery and distribution workflows that may include broadcast and/or apps where some content is delivered and streamed live, and where media is a core part of their output.

The first community is large and significant, and likely generates more video. For them, simplicity, resilience to syntax errors, and native support is important. For this community, the data model behind a "Cue" as a presentational object probably makes sense, but even more likely, they will use a service that does more of the work for them, like YouTube, and never touch the technical mechanisms under the hood.

The second community drives a lot of consumption of video. For them, commonality across platforms (broadcast, online), integration into enterprise workflows, B2B, QC, lack of dependency on downstream providers, integration with MSE, consistency of user experience, usage reporting, customer relationship management, legal compliance etc all need to be factored in, and they do care about the technical mechanisms. Formats matter to them. Even putting the XML vs other serialisation question aside (which is important), the "Cue" model sounds initially appealing but, in the end, is no more closely related to the semantic content than the "snapshots of presentation" (ISD) model, particularly for live creation where it is missing lifecycle features like "append text". The ISD model works well though, at the possible expense of accurate language semantics in a live contribution environment. I don't think either is necessarily perfect, i.e. what we might design from scratch if we knew everything we know today.

IMSC is a good fit for this second community, also taking into account the ecosystem around it. As we know there is a complete absence of browser support. You can buy connected TVs and set top boxes everywhere that uses HbbTV 2 or ATSC 3 and they have native players; similarly Apple supports IMSC in HLS for mobile devices, and Google supports TTML in Exoplayer for Android, but to display those subtitles on the web, you must use a polyfill. This feels like either an opportunity - maybe all subtitles should be rendered by polyfills, regardless of format - or a missed opportunity for convergence in browsers.

Privacy vs usability
The other major missing feature is a consistent model for managing customisation of subtitle and caption presentation. This is deliberate, but problematic.

Browsers implement (as per spec) the strict privacy requirement that web applications can never observe system-level user customisation choices. So polyfills cannot see, and therefore cannot apply user preferences, except if a page manages its own local copy of those preferences. This is what we have to do. It results in poor user experience consistency between websites, which is an accessibility problem, and doesn't end up solving the privacy problem because pages that implement their own preference mechanism can obviously report those back to their own origin.

There's a lot of entropy in subtitle customisation preferences, so it's true that it's a potential fingerprinting vector - and it's one of many fingerprinting vectors. I think one of the other concerns driving the decision behind privacy was that revealing an individual's use of assistive tech is personal information because it reveals if they have hearing difficulties. There's a case to be made, certainly based on UK usage behaviour, that the majority of users of subtitles are not doing so to get over a hearing impairment, but for other reasons that may create an environmental barrier to hearing. So I would say that particular reasoning no longer applies.

Lack of usage data
The lack of customisation data means that product folk cannot get usage data about how their content is being used (by inserting code into cue entry and exit handlers they can get some information about if a track is active or not). That's an anti-pattern when it comes to product and service improvement.

If I'm a product manager of a service that provides video with subtitles and my expectation, and default setting, is that the subtitles are 90px high, and 99% of users customise that to 30px, I really should change my defaults, but I can never discover that data. Similarly, if most of my programmes get 20% subtitle use, and there are some significant outliers that get 2% and some that get 70%, that's telling me something about my content that I could use to improve it for everyone. Or if nobody ever uses my French translation subtitles, say. Using the web approach to captions, it's hard to discover it, so I may not make those improvements. That's worse for everyone. (particularly the French speakers!)

Note that the data does not have to be on an individual basis, aggregated and anonymous is fine here. But the web platform has no sensible mechanism for providing aggregated anonymised "private" data to content providers, about their content, today.

I talked more about this in the TPAC breakout session https://www.w3.org/2022/Talks/TPAC/breakout-sessions/private-a11y/private_a11y_breakout.html

What should browser vendors do?

Putting this together, I'm not surprised that browser vendors are not motivated. Their big "customers" don't want to use their solution, or if they accept it then they also accept the inconsistent implementations, and their more numerous small ones aren't asking for change. Google themselves don't use the web approach to captions in YouTube.

Maybe it's time to re-open the conversation about the web approach to subtitles and captions and have another think about how it should work.

Nigel


________________________________
From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Sent: 07 January 2023 1:56 AM
To: Public TTWG List <public-tt@w3.org>; public-texttracks@w3.org <public-texttracks@w3.org>
Subject: Frustrations of captions video delivery

Hi all,

I thought it would be useful to share a video of a Paramount employee explaining their challenges with rendering and styling of videos via browser technology:
https://youtu.be/Z0HqYQqdErE

It's really quite frustrating to see that Web browsers are still not conformant with each other in how to render Cues on the Web and therefore content providers are still needing to build their own rendering technologies to not be hit with lawsuits.

I've stopped being active around caption standards and compatibility on the Web about 10 years ago and I'm so sad that we haven't made much progress since then.

When will browser vendors make this a priority?

Regards,
Silvia.

Received on Friday, 24 February 2023 18:07:43 UTC