Re: [whatwg] How to expose caption tracks without TextTrackCues from Brendan Long on 2014-11-03 (public-whatwg-archive@w3.org from November 2014)

From: Brendan Long <self@brendanlong.com>
Date: Mon, 03 Nov 2014 17:50:28 -0600
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Cc: WHAT Working Group <whatwg@lists.whatwg.org>
Message-ID: <545814C4.7000503@brendanlong.com>
On 11/03/2014 05:41 PM, Silvia Pfeiffer wrote:
> On Tue, Nov 4, 2014 at 10:24 AM, Brendan Long <self@brendanlong.com> wrote:
>> On 11/03/2014 04:20 PM, Silvia Pfeiffer wrote:
>>> On Tue, Nov 4, 2014 at 3:56 AM, Brendan Long <self@brendanlong.com> wrote:
>>> Right, that was the original concern. But how realistic is the
>>> situation of n video tracks and m caption tracks with n being larger
>>> than 2 or 3 without a change of the audio track anyway?
>> I think the situation gets confusing at N=2. See below.
>>
>>>> We would also need to consider:
>>>>
>>>>   * How do you label this combined video and text track?
>>> That's not specific to the approach that we pick and will always need
>>> to be decided. Note that label isn't something that needs to be unique
>>> to a track, so you could just use the same label for all burnt-in
>>> video tracks and identify them to be different only in the language.
>> But the video and the text track might both have their own label in the
>> underlying media file. Presumably we'd want to preserve both.
>>
>>>>   * What is the track's "id"?
>>> This would need to be unique, but I think it will be easy to come up
>>> with a scheme that works. Something like "video_[n]_[captiontrackid]"
>>> could work.
>> This sounds much more complicated and likely to cause problems for
>> JavaScript developers than just indicating that a text track has cues
>> that can't be represented in JavaScript.
>>
>>>>   * How do you present this to users in a way that isn't confusing?
>>> No different to presenting caption tracks.
>> I think VideoTracks with kind=caption are confusing too, and we should
>> avoid creating more situations where we need to do that.
>>
>> Even when we only have one video, it's confusing that captions could
>> exist in multiple places.
>>
>>>>   * What if the video track's kind isn't "main"? For example, what if we
>>>>     have a sign language track and we also want to display captions?
>>>>     What is the generated track's kind?
>>> How would that work? Are you saying we're not displaying the main
>>> video, but only displaying the sign language track? Is that realistic
>>> and something anybody would actually do?
>> It's possible, so the spec should handle it. Maybe it doesn't matter though?
>>
>>>>   * The "language" attribute could also have conflicts.
>>> How so?
>> The underlying streams could have their own metadata, and it could
>> conflict. I'm not sure if it would ever be reasonable to author a file
>> like that, but it would be trivial to create. At the very least, we'd
>> need language to say which takes precedence if the two streams have
>> conflicting metadata.
>>
>>>>   * I think it might also be possible to create files where the video
>>>>     track and text track are different lengths, so we'd need to figure
>>>>     out what to do when one of them ends.
>>> The timeline of a video is well defined in the spec - I don't think we
>>> need to do more than what is already defined.
>> What I mean is that this could be confusing for users. Say I'm watching
>> a video with two video streams (main camera angle, secondary camera
>> angle) and two captions tracks (for sports for example). If I'm watching
>> the secondary camera angle and looking at one of the captions tracks,
>> but then the secondary camera angle goes away, my player is now forced
>> to randomly select one of the caption tracks combined with the primary
>> video, because it's not obvious which one corresponds with the captions
>> I was reading before.
>>
>> In fact, if I was making a video player for my website where multiple
>> people give commentary on baseball games with multiple camera angles, I
>> would probably create my own controls that parse the video track ids and
>> separates them back into video and text tracks so that I could have
>> offer separate video and text controls, since combining them just makes
>> the UI more complicated.
> That's what I meant with multiple video tracks: if you have several
> that require different captions, then you're in a world of hurt in any
> case and this has nothing to do with whether you're representing the
> non-cue-exposed caption tracks as UARendered or as a video track.
I mean multiple video tracks that are valid for multiple caption tracks.
The example I had in my head was sports commentary, with multiple people
commenting on the same game, which is available from multiple camera angles.

We probably do need a way to indicate that tracks go together when they
don't all go together though. I think it's come up before. Maybe the
obvious answer is, "don't have tracks that don't go together in the same
file".

>> So, what's the advantage of combining video and captions, rather than
>> just indicating that a text track can't be represented as TextTrackCues?
> One important advantage: there's no need to change the spec.
>
> If we change the spec, we still have to work through all the issues
> that you listed above and find a solution.
>
> Silvia.
I suppose not changing the spec is nice, but I think the changes are
simpler if we have no-cue text tracks, since the answer to all of my
questions becomes "we don't do that, we just keep the two tracks separate".
Received on Monday, 3 November 2014 23:50:55 UTC