Minutes from Media Timed Events Task Force call 15 June 2020

Dear all,

The minutes from today's Media Timed Events / WICG DataCue API call are now available [1], and copied below.

There were a few action items:

1. Kaz to publish the use case and requirements document as an updated IG Note
2. Rob and Chris to prepare a pull request for whatwg/html issue #5297
3. Chris to prepare a pull request for whatwg/html issue #5306
4. Chris to follow up with DASH-IF on alignment of API requirements
5. John and Chris to organise follow up with media companies

As mentioned, W3C members are welcome to join DASH-IF calls on this topic. The next call is Friday 26th June at 14:00 UTC. Please contact Chris for dial-in details.

Kind regards,

Chris (Co-chair, W3C Media & Entertainment Interest Group)

[1] https://www.w3.org/2020/06/15-me-minutes.html

W3C
- DRAFT -
Media Timed Events / DataCue API
15 Jun 2020
Agenda

Attendees

Present
  Kazuyuki_Ashimura, Chris_Needham, Francois_Daoust, Eric_Carlson, Gary_Katsevman, Nigel_Megitt, John_Fields, Rob_Smith

Regrets

Chair
  Chris

Scribe
  tidoust

Contents

Topics
  Status of publishing the IG note
  Text track cue event timing accuracy
  DataCue overall plan
  How should DataCue expose parsed vs unparsed data, or subsets of emsg data?
  Single vs multiple metadata tracks
  Add TextTrackCue end time representing end of media
  Next call

Summary of Action Items
Summary of Resolutions

<scribe> scribenick: cpn

# Status of publishing the IG note

Kaz: I've been busy elsewhere, so will work on it this week

Chris: Thank you!

<tidoust> scribe: tidoust

Chris: There are changes that we could make, but it's fine to leave it as it is. Section about synchronized rendering could feature the videoFrameCallback API, but given that we ran a call for consensus, we should just publish
.... This is coming up in the bullet chat topic, so we can consider that seperately from this particular IG Note document.
.... In this group, we can focus much more on the DataCue API itself, and leave synchronized rendering aspects to the wider IG to follow up on.

# Text track cue event timing accuracy

https://github.com/whatwg/html/issues/5306 Issue #5306 in WHATWG

Chris: The proposal here is to add some wording in the Time marches on algorithm to say that there is an expectation that cues will be triggered ideally within 20ms of their position on the media timeline.
.... We discussed a couple of months ago on the Media WG where this should go to MDN or the spec itself. I believe that it would be useful to capture it in the HTML spec itself since that the mother of all docs.
.... I have an open action to turn the wording in this issue into a pull request.
.... My understanding of implementations is that it would in effect reflect the current state of implementations, given the work in Chromium to improve accuracy of cue events.
.... It might be worthwhile reviewing the issue that's open and the proposed wording that we put in.

Eric: That sounds fine to me, and I don't think there was any disagreement within the Media WG, so I think you can just move forward and finish it up.

# DataCue overall plan

Chris: I have been discussing offline with John Simmons and members of the DASH-IF group. We're hoping that, by TPAC time this year, we want to have as much impementation support in place that we can actually develop the API specification.
.... We really need to engage much more with implementers to make that happen. Plan that we discussed is first of all make sure that what we're describing aligns with what the DASH-IF is doing.
.... I believe that is the case, but we need to run a review to ensure that it is true.
.... Once we have that, we need to engage with media companies to make sure that we have captured all requirements.
.... We really need to be showing that media companies want the API, as it may not be a priority for some implementers.
.... So: finish the explainer, reach out to media companies, and in parallel invite people from Apple, Google, Microsoft, Mozilla... to make the case about the API.
.... I think that's what preventing me from turning the explainer into a spec directly, I'd like to make sure that everyone is on board first.
.... There is no firm dates set for meetings with DASH-IF, but we'll make sure to advertize them so that you can join if interested.

Eric: That sounds fine. I agree that it may be a challenge to drum up interest from other browser vendors, but that is what it is.
.... Obviously, you need to be prepared for possible disagreement for particulars of the API, but that's just always true...

# How should DataCue expose parsed vs unparsed data, or subsets of emsg data?

<cpn> https://github.com/WICG/datacue/issues/21 WICG Issue 21

Chris: This is feedback from someone from Microsoft
.... This is one of the areas of disagreement that we come back to since we started this work. Discussed at TPAC last year.
.... Eric, I believe you argued to expose parsed data to the developer. I agree with that.
.... On the other hand, with emsg box, applications may want to add additional parsers to parse binary messages in the media, and the question became: as new message types are invented and introduced, how do we now that an implementation supports parsing and presenting in a parsed form a particular message type?
.... Some implementations may expose the message in its raw form while others may expose it as a parsed message.
.... We had some back and forth with the contributor from Microsoft, and ended up with a proposal that allows both options, and the user agent can choose which to use: either expose the parsed data (linked to some spec that describes the structure per cue type), or if it doesn't support the parsing of a particular cue type, then it could still expose the message as an ArrayBuffer field.
.... One of the implications is that, I think, web applications would always have to ship a parsing library to handle the second case.
.... Unless we can get to a situation where this is core set of cue types that are supported across all implementations, I don't know how we can avoid that scenario.
.... I'm trying to get to the heart of one of these potential points of disagreement.
.... I've modified the interface definition of DataCue slightly to make the data and value properties nullable.
.... This would provide a migration path for HbbTV from unparsed data to parsed data.

Eric: Given that the proposed interface already allows value to be an array buffer, why would we want to have an extra field that is an array buffer?

Chris: I was thinking of an "either...or", not both.

Eric: In that case, why not use only "value"?
.... That is exactly what I do in Webkit now.
.... I don't know how to parse emsg, so I don't and I just put it as an array buffer in the "value" field

Chris: I think it makes sense to not have redundant fields. From an application perspective, you need a way to detect in which case you are.

Eric: which you'll have to do in any case, since type is "any".
.... The comment that "value" is always null in Webkit is true. I didn't remove it because I didn't know whether there was any existing content that would assume that the property would be there.
.... I think, moving forward, that we should remove it.

Chris: I think the only hiccup is that HbbTV uses this field.
.... But with the next issue, we're already introducing breaking changes anyway for HbbTV.

Eric: That is also easy to polyfill.

Chris: True.
.... I'm happy to update the explainer and propose removal.

Eric: That makes sense to me.

# Single vs multiple metadata tracks

<cpn> https://github.com/WICG/datacue/issues/20

Chris: Single model for cue type. HbbTV uses the event metadata dispatch type field to identify what kind of event each track is carrying.
.... I think it's inconvenient from the application developer perspective. With multiple tracks, the application has to create multiple TextTracks.
.... What we're proposing is to consolidate all messages onto a single metadata TextTrack, and then the type information is carried in each individual cue.
.... My understanding is that this matches the Webkit model.

Eric: Yes.

Chris: There is a requirement that we captured from the DASH-IF where they wanted to make receipt of particular cue types an opt-in from the application point of view.
.... In DASH, the manifest describes which types of events the player should expect to see, and the application subscribes to specific types (ID3 messages, manifest updates, etc.)
.... With the model that we're proposing as it stands, the user agent would expose all of the events that it supports to the application and the application would be responsible for filtering events it is interested in.
.... I need to get feedback from DASH-IF on whether this is a critical requirement, or whether they're happy to have it at the application level.

Eric: In an HLS stream which can contain any number of types of metadata, how would an application know which types are in the stream so that it can subscribe to those it is interested in following?
.... If you don't have a manifest that describes what is in the stream, how do you handle the situation?

Chris: That's right and in the general case, we don't have a manifest.

Eric: That's right. I would argue that it is not difficult to setup an event listener and filter on the cue type. That does not create a lot of overhead. Think about mouse events for instance, which fire far more frequently than the events envisioned here.

Chris: I tend to agree, but I'm not an implementor of this on TV devices, I do not have the context, I'll take this back to DASH-IF.
.... If the actual cue type is always carried in the cue itself, does the texttrack type make any sense?

Eric: I agree that it is not at all useful (and disagreed to its inclusion in the first place but lost that fight).
.... I think it makes sense to propose to remove it.

Chris: OK, I'll add an issue to track that.

# Add TextTrackCue end time representing end of media

<cpn> https://github.com/whatwg/html/issues/5297 WHATWG Issue 5297

Chris: To recap, this is the idea that cues can be unbounded. They have a known start time, but we don't necessarily know when they are scheduled to end, and it may be that it is when they end that we can tell.
.... At the moment, we don't have a direct way to express this situation. Proposal is to add an unbounded value "Infinity". This also aligns to the media.duration which can be infinite too.

Eric: In the live HLS case for instance, duration is used this way.
.... I don't think there is any issue here. Silvia was disagreeing but realized this alignment and last comment suggests she's fine with the update.

Chris: How to make progress?

Eric: I'd have somebody write a pull request

Chris: And get implementer's feedback?

Eric: I think it is going to be easier for people to share an opinion when there is a concrete proposal at hand.

Chris: OK, maybe Rob or I can draft something then.
.... My recollection of the feedback from the Media WG was that there was some concern of allowing cue times beyond the duration of a stream.

Eric: This already happens in Webkit. If it's a cue of a live stream, then its end time is set to the infinite duration of the stream. I doubt Jer would object to this.

Nigel: Allowing it to be infinity makes a lot of sense. Do we need an algorithm for when time changes from infinity to a finite number?

Rob: I think this is a separate issue and that we should address it separately.

Eric: I agree.

Chris: Having reviewed time marches on, I believe this is actually covered.

Nigel: The meaning of infinity when it's with respect with some media. It may mean "never called" or "called when the media ends".

Rob: To be consistent with the current definition, infinity should mean the end of media.

Nigel: I think it's important that we specify this for interop reason.

Eric: Another angle that may have already been specificed, are you proposing that a cue may have an infinite endTime in a finite file as well?

Rob: Yes.
.... WebVMT provides a single example of this.
.... Time A, you are at location A, that won't change in the future.
.... If you imagine a capturing scenario, you create cues with unbounded end times.
.... But when you stop capturing, you now have a bounded stream, requiring an end time means that all your cues are invalid.

Eric: I don't quite follow. If you're recording and you open that file, the media stream has a finite duration.

Rob: Time A, cue runs to infinity. That's valid during capturing. But then when you stop, the infinity value is no longer valid.

Eric: My issue is that it is logically a problem.
.... For example, it is perfectly valid to have a file with audio and video tracks that have different amount of media in them.
.... Different durations for the tracks for instance.
.... But the duration of the file is defined with respect to those tracks. You have to pick a duration.
.... I'll have to think about it.

Rob: I don't follow you. What would be the problem with having infinite cue endTime?
.... Hmm, I see, the end of the media for the video and audio would be different.

Eric: Right.
.... To be clear, if Webkit gets a cue in a file with an infinite duration, it sets endtime to infinite. If it is finite, it sets endtime to the duration of the stream.

Nigel: Perhaps a test is that behavior should be the same regardless of whether media is infinite or finite: When end time >= current media time, end event gets fired.

Eric: Duration of the media stream is defined by the duration of the longest track. The file duration is 5mn if audio is 1mn and video is 5mn.
.... If you have a cue whose duration is infinite, that would imply that the duration of that text track is infinite, and potentially then the duration of the media stream is infinite, which is clearly not what you want in a file with finite media tracks.

Rob: Cue end time = Infinity => at end of media, which is consistent with the 5-min video, 1-min audio definition

Eric: The duration of a media file is not defined in HTML. It is defined in individual specs for different media types.
.... Fine to leave that as open issue.

<Zakim> gkatsev, you wanted to mention MSE based players

Gary: MSE players will often have to update their duration based on extra information.
.... Ex: 10s segments but actual duration is 9.7s.
.... Being able to say: "I want this cue to trigger whenever playback finishes" is useful because it's possible that duration may end up being less than initially planned.
.... and therefore we would never trigger that cue.

Chris: Interesting. We need to follow up on definitions of duration.
.... OK, let's capture this in issues.

# Next call

Chris: 20th of July would be our next scheduled call, same time.

<kaz> [adjourned]

Summary of Action Items
Summary of Resolutions
[End of minutes]
Minutes manually created (not a transcript), formatted by David Booth's scribe.perl version (CVS log)
$Date: 2020/06/15 16:14:25 $

Received on Monday, 15 June 2020 17:04:31 UTC