Re: [whatwg] How to expose caption tracks without TextTrackCues

On 11/03/2014 04:20 PM, Silvia Pfeiffer wrote:
> On Tue, Nov 4, 2014 at 3:56 AM, Brendan Long <self@brendanlong.com> wrote:
> Right, that was the original concern. But how realistic is the
> situation of n video tracks and m caption tracks with n being larger
> than 2 or 3 without a change of the audio track anyway?
I think the situation gets confusing at N=2. See below.

>> We would also need to consider:
>>
>>   * How do you label this combined video and text track?
> That's not specific to the approach that we pick and will always need
> to be decided. Note that label isn't something that needs to be unique
> to a track, so you could just use the same label for all burnt-in
> video tracks and identify them to be different only in the language.
But the video and the text track might both have their own label in the
underlying media file. Presumably we'd want to preserve both.

>>   * What is the track's "id"?
> This would need to be unique, but I think it will be easy to come up
> with a scheme that works. Something like "video_[n]_[captiontrackid]"
> could work.
This sounds much more complicated and likely to cause problems for
JavaScript developers than just indicating that a text track has cues
that can't be represented in JavaScript.

>>   * How do you present this to users in a way that isn't confusing?
> No different to presenting caption tracks.
I think VideoTracks with kind=caption are confusing too, and we should
avoid creating more situations where we need to do that.

Even when we only have one video, it's confusing that captions could
exist in multiple places.

>>   * What if the video track's kind isn't "main"? For example, what if we
>>     have a sign language track and we also want to display captions?
>>     What is the generated track's kind?
> How would that work? Are you saying we're not displaying the main
> video, but only displaying the sign language track? Is that realistic
> and something anybody would actually do?
It's possible, so the spec should handle it. Maybe it doesn't matter though?

>>   * The "language" attribute could also have conflicts.
> How so?
The underlying streams could have their own metadata, and it could
conflict. I'm not sure if it would ever be reasonable to author a file
like that, but it would be trivial to create. At the very least, we'd
need language to say which takes precedence if the two streams have
conflicting metadata.

>>   * I think it might also be possible to create files where the video
>>     track and text track are different lengths, so we'd need to figure
>>     out what to do when one of them ends.
> The timeline of a video is well defined in the spec - I don't think we
> need to do more than what is already defined.
What I mean is that this could be confusing for users. Say I'm watching
a video with two video streams (main camera angle, secondary camera
angle) and two captions tracks (for sports for example). If I'm watching
the secondary camera angle and looking at one of the captions tracks,
but then the secondary camera angle goes away, my player is now forced
to randomly select one of the caption tracks combined with the primary
video, because it's not obvious which one corresponds with the captions
I was reading before.

In fact, if I was making a video player for my website where multiple
people give commentary on baseball games with multiple camera angles, I
would probably create my own controls that parse the video track ids and
separates them back into video and text tracks so that I could have
offer separate video and text controls, since combining them just makes
the UI more complicated.


So, what's the advantage of combining video and captions, rather than
just indicating that a text track can't be represented as TextTrackCues?

Received on Monday, 3 November 2014 23:25:27 UTC