Re: [whatwg] How to expose caption tracks without TextTrackCues from Bob Lund on 2014-11-03 (public-whatwg-archive@w3.org from November 2014)

From: Bob Lund <B.Lund@CableLabs.com>
Date: Mon, 3 Nov 2014 23:54:03 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>, Brendan Long <self@brendanlong.com>
Cc: WHAT Working Group <whatwg@lists.whatwg.org>
Message-ID: <D07D5418.49175%b.lund@cablelabs.com>
On 11/3/14, 3:41 PM, "Silvia Pfeiffer" <silviapfeiffer1@gmail.com> wrote:

>On Tue, Nov 4, 2014 at 10:24 AM, Brendan Long <self@brendanlong.com>
>wrote:
>>
>> On 11/03/2014 04:20 PM, Silvia Pfeiffer wrote:
>>> On Tue, Nov 4, 2014 at 3:56 AM, Brendan Long <self@brendanlong.com>
>>>wrote:
>>> Right, that was the original concern. But how realistic is the
>>> situation of n video tracks and m caption tracks with n being larger
>>> than 2 or 3 without a change of the audio track anyway?
>> I think the situation gets confusing at N=2. See below.
>>
>>>> We would also need to consider:
>>>>
>>>>   * How do you label this combined video and text track?
>>> That's not specific to the approach that we pick and will always need
>>> to be decided. Note that label isn't something that needs to be unique
>>> to a track, so you could just use the same label for all burnt-in
>>> video tracks and identify them to be different only in the language.
>> But the video and the text track might both have their own label in the
>> underlying media file. Presumably we'd want to preserve both.
>>
>>>>   * What is the track's "id"?
>>> This would need to be unique, but I think it will be easy to come up
>>> with a scheme that works. Something like "video_[n]_[captiontrackid]"
>>> could work.
>> This sounds much more complicated and likely to cause problems for
>> JavaScript developers than just indicating that a text track has cues
>> that can't be represented in JavaScript.
>>
>>>>   * How do you present this to users in a way that isn't confusing?
>>> No different to presenting caption tracks.
>> I think VideoTracks with kind=caption are confusing too, and we should
>> avoid creating more situations where we need to do that.
>>
>> Even when we only have one video, it's confusing that captions could
>> exist in multiple places.
>>
>>>>   * What if the video track's kind isn't "main"? For example, what if
>>>>we
>>>>     have a sign language track and we also want to display captions?
>>>>     What is the generated track's kind?
>>> How would that work? Are you saying we're not displaying the main
>>> video, but only displaying the sign language track? Is that realistic
>>> and something anybody would actually do?
>> It's possible, so the spec should handle it. Maybe it doesn't matter
>>though?
>>
>>>>   * The "language" attribute could also have conflicts.
>>> How so?
>> The underlying streams could have their own metadata, and it could
>> conflict. I'm not sure if it would ever be reasonable to author a file
>> like that, but it would be trivial to create. At the very least, we'd
>> need language to say which takes precedence if the two streams have
>> conflicting metadata.
>>
>>>>   * I think it might also be possible to create files where the video
>>>>     track and text track are different lengths, so we'd need to figure
>>>>     out what to do when one of them ends.
>>> The timeline of a video is well defined in the spec - I don't think we
>>> need to do more than what is already defined.
>> What I mean is that this could be confusing for users. Say I'm watching
>> a video with two video streams (main camera angle, secondary camera
>> angle) and two captions tracks (for sports for example). If I'm watching
>> the secondary camera angle and looking at one of the captions tracks,
>> but then the secondary camera angle goes away, my player is now forced
>> to randomly select one of the caption tracks combined with the primary
>> video, because it's not obvious which one corresponds with the captions
>> I was reading before.
>>
>> In fact, if I was making a video player for my website where multiple
>> people give commentary on baseball games with multiple camera angles, I
>> would probably create my own controls that parse the video track ids and
>> separates them back into video and text tracks so that I could have
>> offer separate video and text controls, since combining them just makes
>> the UI more complicated.
>
>That's what I meant with multiple video tracks: if you have several
>that require different captions, then you're in a world of hurt in any
>case and this has nothing to do with whether you're representing the
>non-cue-exposed caption tracks as UARendered or as a video track.
>
>
>> So, what's the advantage of combining video and captions, rather than
>> just indicating that a text track can't be represented as TextTrackCues?
>
>One important advantage: there's no need to change the spec.
>
>If we change the spec, we still have to work through all the issues
>that you listed above and find a solution.

Will we? I agree the case of multiple video tracks, each with different
audio/captions (possibly multiple languages) is complicated. But treating
captions as burned in video means the UA has to sort things out; leaving
them as cueless text tracks means the app figures it out. Having the app
sort it out doesn't make it easier but it is more flexible.

Also, in the case where the text tracks have cues, then the multiple
video/audio/text track case will have to be handled by the app. Why should
the whole model change just because cues are not exposed to javascript?

>
>Silvia.
Received on Monday, 3 November 2014 23:54:43 UTC