Re: [media] Moving forward with captions / subtitles

On Sun, 14 Feb 2010 17:22:23 +0800, Silvia Pfeiffer  
<silviapfeiffer1@gmail.com> wrote:

> Hi Philip,
>
> On Sun, Feb 14, 2010 at 6:45 PM, Philip Jägenstedt <philipj@opera.com>  
> wrote:
>> On Sun, 14 Feb 2010 00:42:50 +0800, Eric Carlson  
>> <eric.carlson@apple.com>
>> wrote:
>>
>>>
>>> On Feb 13, 2010, at 7:01 AM, Philip Jägenstedt wrote:
>>>
>>>> On Sat, 13 Feb 2010 21:04:36 +0800, Silvia Pfeiffer
>>>> <silviapfeiffer1@gmail.com> wrote:
>>>>
>>>>> However, I must say I really like the idea of making it independent  
>>>>> of
>>>>> "text", i.e. leaving the possibility open to add "tracks" of audio or
>>>>> video in future.
>>>>>
>>>>> I'd be happy for something that essentially means "external parallel
>>>>> track".
>>>>
>>>> Considering how many different names we have already come up with, I
>>>> doubt <track> will be the last :) Brainstorm away!
>>>>
>>>  To me the singular "track" implies that only one <source> will be  
>>> chosen
>>> which will not be true in all cases. Maybe <tracks>?
>>>
>>>
>>>>>> role="" is fine, but I'd like to see more ideas on what UAs should  
>>>>>> to
>>>>>> with
>>>>>> it.
>>>>>
>>>>> The thought is to use it not just for captions, subtitles, and  
>>>>> textual
>>>>> audio descriptions, but also for karaoke, lyrics, chapters, timed
>>>>> comments, timed metadata, and other such time-aligned text and
>>>>> annotations. There are examples with lyrics
>>>>> (http://svg-wow.org/audio/animated-lyrics.html, and
>>>>> http://annodex.net/~silvia/itext/chocolate_rain.html), and chapters
>>>>> (http://annodex.net/~silvia/itext/elephant_no_skin_v2.html). I'm sure
>>>>> we will come up with more similar examples.
>>>>
>>>> Yes, but is it expected that the UA should do something with the
>>>> attribute, like make context menus based on it? Or should it be part  
>>>> of the
>>>> track selection algorithm? (Where "track selection algorithm" does  
>>>> not exist
>>>> yet, but is what will select which tracks are enabled by default  
>>>> based on...
>>>> language and such?)
>>>>
>>>  I think the selection of alternates is an important point. Some media
>>> container formats (eg. QuickTime and MPEG-4)  allow an author to mark  
>>> tracks
>>> as begin part of an "alternate group". This instructs the media engine  
>>> to
>>> enable only one track in the group based on a condition on the user's
>>> machine when the file is opened for playback. For example, a movie can  
>>> have
>>> subtitle tracks and chapter tracks in multiple languages, but only one  
>>> of
>>> each is rendered when the movie plays.
>>>
>>>  We need to support this use case with external "tracks", and we need  
>>> to
>>> define the selection algorithm when a file has both internal and  
>>> external
>>> tracks.
>>>
>>>  We also need to define a mechanism to mark tracks as being part of an
>>> alternate group. Is an attribute on <source> enough?
>>>
>>>    <tracks>
>>>        <source type="text/srt" src="en-captions.srt" lang="en"
>>> role="caption">
>>>        <source type="text/srt" src="zh-captions.srt" lang="zh"
>>> role="caption">
>>>       <source type="text/srt" src="en-chapters.srt" lang="en"
>>> role="chapters">
>>>        <source type="text/srt" src="zh-chapters.srt" lang="zh"
>>> role="chapters">
>>>    </tracks>
>>> Or should we have a grouping element like Silvia had in her early
>>> proposal?
>>>
>>>    <tracks>
>>>        <track role="caption">
>>>            <source type="text/srt" src="en-captions.srt" lang="en">
>>>            <source type="text/srt" src="zh-captions.srt" lang="zh">
>>>        </track>
>>>       <track role="chapters">
>>>            <source type="text/srt" src="en-chapters.srt" lang="en">
>>>            <source type="text/srt" src="zh-chapters.srt" lang="zh">
>>>        </track>
>>>    </tracks>
>>>
>>>  I hesitate to define yet another element, but I think the markup in a
>>> complex case like Silvia's Elephants Dream sample,
>>> http://annodex.net/~silvia/itext/elephant_no_skin_v2.html, is clearer
>>> because of it. On the other hand, will complex cases like this be  
>>> common
>>> enough that we need it?
>>
>> How to group/nest <tracks>, <track>, <source>, type="", lang="" and  
>> role=""
>> is a bit of a headache...
>>
>> For <video> and <audio>, <source>s are mutually exclusive and should
>> represent exactly the same resource with only technical differences like
>> codec, bitrate or resolution. It's not clear that we at all need  
>> <source> to
>> switch between text formats as a common format here is very likely to be
>> found without much controversy. We might want it if we seriously expect  
>> more
>> than one format to be used though.
>
> I wouldn't think the format is the issue here - @type is just a
> description of what format to expect, not as a selection mechanism.
> Just like in an img element there is also support for several file
> formats, but there is no means to mark up selections between different
> ones. Even if we support more than one format, that shouldn't become a
> selection criterium.

If type has no influence over what track is selected I suggest we not have  
the attribute at all.

>> Do we at all want to support the case of enabling multiple text tracks  
>> in a
>> declarative way or via browser context menus? In my opinion this is a  
>> bit
>> overkill (I've never used a media player that supports it) and we might
>> delegate this to scripts. If others agree, we don't need any grouping
>> element for the purpose of making a group of tracks mutually exclusive  
>> --
>> all tracks are mutually exclusive.
>
> Did you mean all source elements within a track of a certain role are
> mutually exclusive? If so, I agree.

No, I mean that all tracks of any language/role/whatever are mutually  
exclusive. I've never seen a media player that allows enabling multiple  
text tracks simultaneously and wouldn't want to figure out a UI for doing  
it in the context menu or with native controls.

>> The other kind of grouping is per role (subtitles/captions/karaoke),
>> language and... something else? If this is given by role="" and  
>> lang="", how
>> should a context menu be constructed? Group by role or by language?
>
> To me - just from a menu construction POV - it seemed to make more
> sense to group by role and within role as alternatives.
>
> The idea of grouping per language seems to me to make things more  
> complicated.

No grouping at all makes it even simpler :)

>> Would it in fact be sufficient to just use a flat list of <track type=""
>> role="" src="" lang=""> ?
>
> No, that makes it harder to read and to see which <source>s are
> alternatives to each other.
>
>> Or should we let the grouping be in any order and let role="" and  
>> lang="" be
>> inherited? E.g.
>>
>> <video>
>>  <tracks lang="en">
>>    <track role="caption">
>>    <track role="subtitles">
>>  </tracks>
>>  <tracks lang="sv">
>>    <track role="caption">
>>  </tracks>
>> </video>
>>
>> vs
>>
>> <video>
>>  <tracks role="caption">
>>    <track lang="en">
>>    <track lang="sv">
>>  </tracks>
>>  <tracks role="subtitles">
>>    <track lang="en">
>>  </tracks>
>> </video>
>>
>> The difference is mainly in how context-menus are presented.
>>
>> Realistically though, few videos will have more than a few different  
>> tracks,
>> and complicating the markup to enable properly nested context menus  
>> really
>> might not be worth it.
>>
>> Just thinking out loud here...
>
>
> To me it's a semantic issue, too. If we group by language, does that
> mean I can only select one such group to be displayed and all the
> others are alternatives? That would mean I have to take the English
> captions with the English subtitles and the English audio description.
> I don't think that makes much sense. Rather, I would e.g. chose the
> English caption and the German subtitles.
>
> The 'alternate group' specification in MPEG (and probably QuickTime)
> does the following:
>
> “alternate_group”: is an integer that specifies a group or collection
> of tracks. If this field is 0 there is no information on possible
> relations to other tracks. If this field is not 0, it should be the
> same for tracks that contain alternate data for one another and
> different for tracks belonging to different such groups. Only one
> track within an alternate group should be played or streamed at any
> one time, and must be distinguishable from other tracks in the group
> via attributes such as bitrate, codec, language, packet size etc. A
> group may have only one member.
>
> I think with the tracks that are grouped by "role" we get exactly this
> behaviour. The distinction between them is based on lang, type and
> media query, so fits very well with this description of "alternate
> group".

No matter how we group it the tracks will be the same and the information  
about the tracks available from the markup will be the same. I don't think  
that grouping should have any influence on track selection. Grouping is  
only relevant if we want to:

* declaratively enable multiple text tracks simultaneously (in markup), or  
via native browser UI

* provide some nesting in context menus

I don't think either of these are important enough to introduce new  
elements or attributes.

Am I missing something?

-- 
Philip Jägenstedt
Core Developer
Opera Software

Received on Sunday, 14 February 2010 11:24:42 UTC