Re: Updating sourcing in-band text track for MP4 files

Hi Silvia,

Le 16/09/2013 05:26, Silvia Pfeiffer a écrit :
> On Thu, Sep 12, 2013 at 1:50 AM, Cyril Concolato
> <cyril.concolato@telecom-paristech.fr>  wrote:
>> Hi all,
>>
>> The current HTML5 spec [1][2] explains how to build text tracks from ISO
>> tracks, but only for the case where the ISO track is a timed metadata track
>> (metx, mett). First, this does not cover all tracks which can be potentially
>> useful in a web page (e.g. 3GPP Timed Text).
> Are you expecting browser to implement native 3GPP Timed Text support?
I do not expect that to happen.
> If so, a TextTrackCue sub-interface should be defined.
> If not, since it's captions, it would make sense to define a mapping
> to WebVTT cue content & cue settings to be able to expose them in
> existing interfaces.
Could be interesting, yes.
> At minimum, it should be exposed as @kind=metadata with 3GPP Timed
> Text content exposed in .text of whatever we decide to make the
> generic interface for such cues (right now, it's TextTrackCue, but we
> have the proposed UnparsedCue interface in preparation).
Agree, so that JS librairies could be used to render them.
>
>> Also, with the recent MPEG work
>> on the carriage of Timed Text for TTML and WebVTT [3], I think the HTML spec
>> should be updated (or maybe that text moved to the ISO specification). To my
>> knowledge, it is not implemented yet by browsers.
> I'd be happy for some of that to move to the ISO specification, in
> particular if you want to map all the ISO tracks. However, some
> description of what should happen needs to be included in the HTML
> spec. Let's work on what that should be.
>
>
>> In the light of the recent and long (!!) discussions on Text Tracks, I would
>> like to propose the following:
>> - When possible (as indicated by Eric [5], this is not always possible), all
>> ISO tracks, except when the handler type is 'vide', 'auxv', 'soun' or
>> 'hint', should be exposed as TextTracks (ie. this covers the 'meta' tracks
>> but now also 'subt' (used for TTML) or 'text' (used for WebVTT) tracks, and
>> other tracks, see the register at [4])
> Can you go through all of these and make a list of the types under
> question and where they fit into one of the semantic @kind values that
> the HTML spec has? The list athttp://mp4ra.org/codecs.html  seems huge
> and not cover all the types you're mentioning.
It's not so big once you remove Audio/Video/Hint handler types, the 
remaining stream types would be:
- ISO stuff: Text timed metadata, XML timed metadata, URI identified 
metadata, MPEG-4 Systems streams, SVC metadata, text streams
- DVB stuff: Track Level Index Track, Movie level index track,
- 3GPP/OMA: 3GPP Timed Text, OMA Keys,
- DECE Sub-titles (Timed Text),
- Apple 32/64 bit timecode samples

> Also, a nit-pick: I am confused why WebVTT is regarded as "Textual
> meta-data with MIME type" when it's just generally timed-aligned bits
> of data?
The ISO spec is a quite confusing here and maybe the MP4RA site too. 
There are 2 parameters to consider:
- the *handler type* (3rd column in the MP4RA site) that classifies the 
content in large categories, to inform the player about the broad 
capabilities it needs to have to process the stream, and which can have 
the following 4CC values (i.e. ability to process) : 'soun' (sound), 
'vide' (video), 'subt' (subtitles potentially with images), 'text' 
(subtitles without images), 'hint' (transport protocol packets) or 
'meta' (metadata).
- and the stream type (or *sample entry type*, 1st column) also 
identified by a 4CC.

Unfortunately, there is some overlap in the handler types between 
'subt', 'meta' and 'text'. I lost the battle proposing to harmonize 
them. So here are some examples of interest (using <handler 
type>/<stream type>/<additional parameters when the stream type is too 
generic>):
- WebVTT is identified as 'text'/'wvtt'
- TTML is identified as 'subt'/'stpp'
- 3GPP Timed Text is identified as 'text'/'tx3g'
- a generic XML metadata stream would be: 'meta'/'metx'/<namespace>
- a generic text metadata stream would be:  'meta'/'mett'/<mime format>

As for the one you mention "Textual meta-data with MIME type" it is 
identified as 'meta'/'text'/<mime format> and I can't find what it is 
used for...

>> - then, if the couple ISO-parser/Browser is capable of producing an
>> equivalent WebVTT representation of the text track content (of any @kind,
>> possibly metadata) without losing information, the
>> @inBandMetadataTrackDispatchType is left empty and the track is populated as
>> if it was an out-of-band WebVTT track. This would be used for example when
>> WebVTT content is carried in ISO tracks but could be used for other formats
>> where the mapping to WebVTT is feasible/simple. Note we could add a similar
>> text for TTML once the TTML cues are defined.
> Note the above mentioned distinction between the currently proposed
> UnparsedCue and VTTCue - this should be taken care of here, too.
>
> So, first you need to check if the format in cues is natively
> supported in the browser and use that TextTrackCue sub-interface for
> the cues.
> (e.g. if TTMLCue is supported in the browser, expose it as TTMLCue)
>
> Only if it's not supported and it's not semantically @kind=metadata,
> suggest converting it to WebVTT.
Agree.

>
>
>> - and otherwise (if a WebVTT representation cannot be generated or generated
>> without loss),
>>    - the TextTrack object is populated as follows:
>>       - the @kind is set to 'metadata'
>>       - the @label is set to the ISO 'track handler name'
>>       - the @id is set to the ISO track id
>>       - the @inBandMetadataTrackDispatchType contains the base64 encoded
>> sample entry box.
>>    - and each sample produces a cue built as follows:
>>       - the id attribute is empty
>>       - the pauseOnExit attribute is set to false
>>       - the start and end time of the cue are the start and end time of the
>> sample.
>>       - the content of the cue contains the sample data. Note: the cue
>> content can be in .text (base64 encoded if initially binary) or if the cue
>> interface (TextTrackCue, VTTCue or UnParsedCue or whatever the name)
>> includes an ArrayBuffer, we should use that.
> That makes sense to me with UnparsedCue as the interface.
Ok, I'll make sure this is integrated when the interface finally shows up.

Cyril
>
> Cheers,
> Silvia.
>
>
>> Comments?
>>
>> Cyril
>>
>> [1]
>> http://www.w3.org/html/wg/drafts/html/master/embedded-content-0.html#sourcing-in-band-text-tracks
>> [2]
>> http://www.w3.org/html/wg/drafts/html/master/embedded-content-0.html#guidelines-for-exposing-cues-in-various-formats-as-text-track-cues
>> [3]
>> http://www.w3.org/community/texttracks/2013/09/11/carriage-of-webvtt-and-ttml-in-mp4-files/
>> [4]http://mp4ra.org/codecs.html
>> [5]http://lists.w3.org/Archives/Public/public-html/2013Sep/0012.html
>>
>> --
>> Cyril Concolato
>> Maître de Conférences/Associate Professor
>> Groupe Multimedia/Multimedia Group
>> Telecom ParisTech
>> 46 rue Barrault
>> 75 013 Paris, France
>> http://concolato.wp.mines-telecom.fr/
>>
>>


-- 
Cyril Concolato
Maître de Conférences/Associate Professor
Groupe Multimedia/Multimedia Group
Telecom ParisTech
46 rue Barrault
75 013 Paris, France
http://concolato.wp.mines-telecom.fr/

Received on Tuesday, 17 September 2013 14:21:59 UTC