Re: Proposal for initial CG focus

Hi Bob,

Thanks too for starting the discussion. In general, I'm really in line 
with what your draft proposes, except for some details. I added some 
answers to your comments below.

I also have some points of organization:
- do we have a bug tracker with the CG process?
- Can we put the spec on some versioning server (GitHub, W3C, ...) so 
that people can propose modifications?
- Will you be at TPAC next week? Can we have a (short) informal meeting, 
maybe during the unconference wednesday on the topic.

Le 31/10/2013 21:10, Bob Lund a écrit :
> Hi,
>
> I'd like to start the discussion about the requirements and scope for the
> CG work.
>
> Currently, AFAIK there is no specification that fully defines how a UA
> should expose in-band tracks to Web applications.  Unspecified details are:
>
> * Which in-band tracks of a media resource the UA will expose.By expose I
> mean make available as VideoTrack, AudioTrack or TextTrack objects.
In my view, the browser should expose an audio/video track that it knows 
it can natively decode it as an AudioTrack/VideoTrack and for all others 
it should expose it as a TextTrack, possibly with DataCue (base64 
encoded for instance). I do think there are cases where it makes sense 
to decode video/audio track content in JavaScript (e.g. decoding a depth 
video track). It should be up to the JS-application developer to decide 
based on the performances it get. We need to be careful though to expose 
the data efficiently (e.g. not load it until it's requested to control 
the memory consumption, and maybe offer a mechanism to filter the cue 
change events if the frame rate is too high).

> * How metadata in the media resource about the in-band tracks is made
> available to Web applications. If the UA recognizes the in-band track,
> some of the metadata associated with the track will be made available by
> the "kind", "language" and "inbandMetadataTrackDispatchType"  attributes.
> This works fine when the UA fully recognizes the track and when the
> metadata maps completely to the predefined valid attribute values. But,
> this needn't be the case. For example, the UA may recognize an MPEG-2 TS
> audio track but not recognize metadata designating it as a Descriptive
> Video track.
I understand. I think I agree. You have to distinguish:
- the UA can support natively the decoding of the track or not;
- and the UA can map the descriptive data associated to the track to 
HTML 5 descriptive data.

In the MP4 FF case, there is no easy mapping of audio/video track info 
to an HTML5 kind. At some point MPEG was considering adding HTML5 kind 
types. This has not progressed yet though. Same for MPEG-2, I think.

>
> For more deterministic handling of inband tracks, the UA could be required
> to:
>
> * Expose all in-band tracks, in some form.
Yes
> * Make media resource metadata  associated with the inband tracks
> available to Web applications. [1] does this through a TextTrack.
> * Provide data so that the Web application can correlate the metadata data
> with the appropriate Track object.
I'm not sure I follow you here. I guess this has to do with the notion 
of "Track Description TextTrack" in [2].

This idea is interesting and I see where it comes from. An MPEG-2 TS 
program can have a time-varying number of streams in a program and 
stream changes are signaled with a PMT. So indeed, you could consider a 
PMTTextTrack to signal those changes. If understand correctly, the 
"Track Description TextTrack" is a generalization of that PMTTextTrack. 
For MP4, this would correspond to a 'moov'-TextTrack, but in MP4 the 
moov is static for the whole duration of the file and the number of 
tracks does not change. I can't think at the moment of file-wide 
time-varying metadata information that is not represented in the MP4 
file format as a track. So that "Track Description TextTrack" would have 
only 1 cue.

My initial idea (not sure it covers every case in MPEG-2) was rather to 
expose the time-varying information related to a track in the track 
itself. For instance, for MPEG-2 TS, if a PMT updates the descriptor 
associated to an elementary stream, for which there is already a created 
TextTrack, I would create a new cue for that new information. So the cue 
data for every stream would carry two types of information: signaling 
and/or real data.

>
> [1] is a specification written by CableLabs about how a UA would meet
> these requirements in the case of MPEG-2 transport stream media resources.
> This spec was written before some of the recent additions to HTML5, e.g.
> inbandTrackDispatchType. The users of [1] would like to see a W3C spec
> that addresses the same problem, takes into account the current HTML5.1
> work and addresses some details that [1] missed, e.g. more precise
> definition of what constitutes a metadata TextTrack, how in-band data gets
> segmented into TextTrackCues, how Cue start and end times are derived from
> the in-band track.
>
> [2] is an informal draft describing how the technique in [1] can be
> applied to WebM, Ogg and MPEG4 media resources. This CG could address
> these media container formats as well.
>
> So, at a minimum, I propose:
>
> * That the CG start by discussing the requirements outlined above, with
> the goal of creating a spec at least for MPEG-2 TS (recognizing that some
> of this may already be covered by HTML5/1.
I agree.
> * WebM and MP4 are widely supported in browsers so it might make sense to
> cover these formats.
As you know, I started working on the mapping of MP4 tracks onto HTML 
Tracks in [3]. My idea was to use a WebVTT-based syntax as an 
intermediate solution, until the CG spec is out, to experiment. I think 
it's very close to what you have proposed.
>
> If there is sufficient interest, we could take on other inband track
> formats as well.
Indeed, we could think of RTP-based streams in a future version, but 
there is enough on the plate for now.
>
> Comments?
Some additional comments on [2]:
- To identify the type of the track data, I don't think we should rely 
on the label. For instance, in [3], I use the MP4 track handler as value 
for the label. I think the inbandtrackdispatchtype could be used. But 
this raises the question of what syntax to use? I'm not sure MIME is 
appropriate, or maybe with 'codecs' and 'profile' parameters. In the MP4 
work [3], I've used a mix of MIME type when the actual cue payload 
conforms to that MIME type (e.g. when an MP4 track contains SVG data), 
or the MPEG-4 "stream type" and "object type indication". We could 
expose as well MPEG-2 TS "stream type" maybe in the "codecs" parameters 
of the MPEG-2 MIME type.
- I agree that original track Id should be exposed, we could recommend 
to set the Track.id to the Media Fragment Identifier representing that 
track in the original file. In [3], I've used 'trackId' but that's not 
generic enough.
- I think there is a problem in setting endTime to Infinity as the 
WebIDL for that attribute is not "unrestricted double". We could 
recommend to set the endTime to the startTime of the next cue but that 
introduces latency...
- I think the algorithm to expose mp4 tracks as "text" TextTrack or 
"base64-text" TextTrack can be improved to account for the new MPEG-4 
Part 30 tracks for subtiltes (e.g. WebVTT and TTML and others). I can 
fix that if you want.
- I think we need to discuss DASH/HLS and others separately, taking MSE 
into considerations. I'm not sure we need a mapping of those.
- I think also that we need to be consistent with what happens with 
MediaStreams, although I've not followed it closely...

>
> [1] http://www.cablelabs.com/specifications/CL-SP-HTML5-MAP-I02-120510.pdf
> [2] http://html5.cablelabs.com/tracks/media-container-mapping.html
[3] 
http://concolato.wp.mines-telecom.fr/2013/10/24/using-webvtt-to-carry-media-streams/ 


Regards,
Cyril


-- 
Cyril Concolato
Maître de Conférences/Associate Professor
Groupe Multimedia/Multimedia Group
Telecom ParisTech
46 rue Barrault
75 013 Paris, France
http://concolato.wp.mines-telecom.fr/

Received on Thursday, 7 November 2013 17:11:37 UTC