Re: Proposal for initial CG focus from Bob Lund on 2013-11-07 (public-inbandtracks@w3.org from November 2013)

From: Bob Lund <B.Lund@CableLabs.com>
Date: Thu, 7 Nov 2013 22:01:13 +0000
To: Giuseppe Pascale <giuseppep@opera.com>, Cyril Concolato <cyril.concolato@telecom-paristech.fr>
CC: "public-inbandtracks@w3.org" <public-inbandtracks@w3.org>
Message-ID: <CEA159EB.3694B%b.lund@cablelabs.com>
From: Giuseppe Pascale <giuseppep@opera.com<mailto:giuseppep@opera.com>>
Date: Thursday, November 7, 2013 2:51 PM
To: Cyril Concolato <cyril.concolato@telecom-paristech.fr<mailto:cyril.concolato@telecom-paristech.fr>>
Cc: "public-inbandtracks@w3.org<mailto:public-inbandtracks@w3.org>" <public-inbandtracks@w3.org<mailto:public-inbandtracks@w3.org>>
Subject: Re: Proposal for initial CG focus
Resent-From: <public-inbandtracks@w3.org<mailto:public-inbandtracks@w3.org>>
Resent-Date: Thursday, November 7, 2013 2:52 PM

maybe we should start by creating a wiki table with the formats we are considering (MPEG2-TS, MP4 etc) and the various Audio, Video and Text tracks attributes and how they are supposed to be set by the UA?

I think this is a good idea. I think the wiki should have a requirements section that applies to all formats and a per format section for considering the implementation of the requirements for that format. I believe that the CG has access to W3C resources for such a wiki but I don't know how to make use of them. Does someone know?


This will make it easy to see if we are missing something. After the table is complete, it should be easy to write the equivalent spec text.

Note that HTML5 itself provide already some "hints" (informative)
http://www.w3.org/html/wg/drafts/html/master/embedded-content-0.html#dom-audiotrack-kind

/g


On Thu, Nov 7, 2013 at 6:11 PM, Cyril Concolato <cyril.concolato@telecom-paristech.fr<mailto:cyril.concolato@telecom-paristech.fr>> wrote:
Hi Bob,

Thanks too for starting the discussion. In general, I'm really in line with what your draft proposes, except for some details. I added some answers to your comments below.

I also have some points of organization:
- do we have a bug tracker with the CG process?
- Can we put the spec on some versioning server (GitHub, W3C, ...) so that people can propose modifications?
- Will you be at TPAC next week? Can we have a (short) informal meeting, maybe during the unconference wednesday on the topic.

Le 31/10/2013 21:10, Bob Lund a écrit :

Hi,

I'd like to start the discussion about the requirements and scope for the
CG work.

Currently, AFAIK there is no specification that fully defines how a UA
should expose in-band tracks to Web applications.  Unspecified details are:

* Which in-band tracks of a media resource the UA will expose.By expose I
mean make available as VideoTrack, AudioTrack or TextTrack objects.
In my view, the browser should expose an audio/video track that it knows it can natively decode it as an AudioTrack/VideoTrack and for all others it should expose it as a TextTrack, possibly with DataCue (base64 encoded for instance). I do think there are cases where it makes sense to decode video/audio track content in JavaScript (e.g. decoding a depth video track). It should be up to the JS-application developer to decide based on the performances it get. We need to be careful though to expose the data efficiently (e.g. not load it until it's requested to control the memory consumption, and maybe offer a mechanism to filter the cue change events if the frame rate is too high).


* How metadata in the media resource about the in-band tracks is made
available to Web applications. If the UA recognizes the in-band track,
some of the metadata associated with the track will be made available by
the "kind", "language" and "inbandMetadataTrackDispatchType"  attributes.
This works fine when the UA fully recognizes the track and when the
metadata maps completely to the predefined valid attribute values. But,
this needn't be the case. For example, the UA may recognize an MPEG-2 TS
audio track but not recognize metadata designating it as a Descriptive
Video track.
I understand. I think I agree. You have to distinguish:
- the UA can support natively the decoding of the track or not;
- and the UA can map the descriptive data associated to the track to HTML 5 descriptive data.

In the MP4 FF case, there is no easy mapping of audio/video track info to an HTML5 kind. At some point MPEG was considering adding HTML5 kind types. This has not progressed yet though. Same for MPEG-2, I think.



For more deterministic handling of inband tracks, the UA could be required
to:

* Expose all in-band tracks, in some form.
Yes

* Make media resource metadata  associated with the inband tracks
available to Web applications. [1] does this through a TextTrack.
* Provide data so that the Web application can correlate the metadata data
with the appropriate Track object.
I'm not sure I follow you here. I guess this has to do with the notion of "Track Description TextTrack" in [2].

This idea is interesting and I see where it comes from. An MPEG-2 TS program can have a time-varying number of streams in a program and stream changes are signaled with a PMT. So indeed, you could consider a PMTTextTrack to signal those changes. If understand correctly, the "Track Description TextTrack" is a generalization of that PMTTextTrack. For MP4, this would correspond to a 'moov'-TextTrack, but in MP4 the moov is static for the whole duration of the file and the number of tracks does not change. I can't think at the moment of file-wide time-varying metadata information that is not represented in the MP4 file format as a track. So that "Track Description TextTrack" would have only 1 cue.

My initial idea (not sure it covers every case in MPEG-2) was rather to expose the time-varying information related to a track in the track itself. For instance, for MPEG-2 TS, if a PMT updates the descriptor associated to an elementary stream, for which there is already a created TextTrack, I would create a new cue for that new information. So the cue data for every stream would carry two types of information: signaling and/or real data.



[1] is a specification written by CableLabs about how a UA would meet
these requirements in the case of MPEG-2 transport stream media resources.
This spec was written before some of the recent additions to HTML5, e.g.
inbandTrackDispatchType. The users of [1] would like to see a W3C spec
that addresses the same problem, takes into account the current HTML5.1
work and addresses some details that [1] missed, e.g. more precise
definition of what constitutes a metadata TextTrack, how in-band data gets
segmented into TextTrackCues, how Cue start and end times are derived from
the in-band track.

[2] is an informal draft describing how the technique in [1] can be
applied to WebM, Ogg and MPEG4 media resources. This CG could address
these media container formats as well.

So, at a minimum, I propose:

* That the CG start by discussing the requirements outlined above, with
the goal of creating a spec at least for MPEG-2 TS (recognizing that some
of this may already be covered by HTML5/1.
I agree.

* WebM and MP4 are widely supported in browsers so it might make sense to
cover these formats.
As you know, I started working on the mapping of MP4 tracks onto HTML Tracks in [3]. My idea was to use a WebVTT-based syntax as an intermediate solution, until the CG spec is out, to experiment. I think it's very close to what you have proposed.


If there is sufficient interest, we could take on other inband track
formats as well.
Indeed, we could think of RTP-based streams in a future version, but there is enough on the plate for now.

Comments?
Some additional comments on [2]:
- To identify the type of the track data, I don't think we should rely on the label. For instance, in [3], I use the MP4 track handler as value for the label. I think the inbandtrackdispatchtype could be used. But this raises the question of what syntax to use? I'm not sure MIME is appropriate, or maybe with 'codecs' and 'profile' parameters. In the MP4 work [3], I've used a mix of MIME type when the actual cue payload conforms to that MIME type (e.g. when an MP4 track contains SVG data), or the MPEG-4 "stream type" and "object type indication". We could expose as well MPEG-2 TS "stream type" maybe in the "codecs" parameters of the MPEG-2 MIME type.
- I agree that original track Id should be exposed, we could recommend to set the Track.id to the Media Fragment Identifier representing that track in the original file. In [3], I've used 'trackId' but that's not generic enough.
- I think there is a problem in setting endTime to Infinity as the WebIDL for that attribute is not "unrestricted double". We could recommend to set the endTime to the startTime of the next cue but that introduces latency...
- I think the algorithm to expose mp4 tracks as "text" TextTrack or "base64-text" TextTrack can be improved to account for the new MPEG-4 Part 30 tracks for subtiltes (e.g. WebVTT and TTML and others). I can fix that if you want.
- I think we need to discuss DASH/HLS and others separately, taking MSE into considerations. I'm not sure we need a mapping of those.
- I think also that we need to be consistent with what happens with MediaStreams, although I've not followed it closely...



[1] http://www.cablelabs.com/specifications/CL-SP-HTML5-MAP-I02-120510.pdf
[2] http://html5.cablelabs.com/tracks/media-container-mapping.html
[3] http://concolato.wp.mines-telecom.fr/2013/10/24/using-webvtt-to-carry-media-streams/

Regards,
Cyril


--
Cyril Concolato
Maître de Conférences/Associate Professor
Groupe Multimedia/Multimedia Group
Telecom ParisTech
46 rue Barrault
75 013 Paris, France
http://concolato.wp.mines-telecom.fr/
Received on Thursday, 7 November 2013 22:02:20 UTC