Re: [whatwg] How to expose caption tracks without TextTrackCues from Silvia Pfeiffer on 2014-11-03 (public-whatwg-archive@w3.org from November 2014)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Tue, 4 Nov 2014 10:41:01 +1100
To: Brendan Long <self@brendanlong.com>
Cc: WHAT Working Group <whatwg@lists.whatwg.org>
Message-ID: <CAHp8n2=CidiJ=f1t5=N4unk1OCCKb2JZq0+YDPBHmTxKgrrd+Q@mail.gmail.com>
On Tue, Nov 4, 2014 at 10:24 AM, Brendan Long <self@brendanlong.com> wrote:
>
> On 11/03/2014 04:20 PM, Silvia Pfeiffer wrote:
>> On Tue, Nov 4, 2014 at 3:56 AM, Brendan Long <self@brendanlong.com> wrote:
>> Right, that was the original concern. But how realistic is the
>> situation of n video tracks and m caption tracks with n being larger
>> than 2 or 3 without a change of the audio track anyway?
> I think the situation gets confusing at N=2. See below.
>
>>> We would also need to consider:
>>>
>>>   * How do you label this combined video and text track?
>> That's not specific to the approach that we pick and will always need
>> to be decided. Note that label isn't something that needs to be unique
>> to a track, so you could just use the same label for all burnt-in
>> video tracks and identify them to be different only in the language.
> But the video and the text track might both have their own label in the
> underlying media file. Presumably we'd want to preserve both.
>
>>>   * What is the track's "id"?
>> This would need to be unique, but I think it will be easy to come up
>> with a scheme that works. Something like "video_[n]_[captiontrackid]"
>> could work.
> This sounds much more complicated and likely to cause problems for
> JavaScript developers than just indicating that a text track has cues
> that can't be represented in JavaScript.
>
>>>   * How do you present this to users in a way that isn't confusing?
>> No different to presenting caption tracks.
> I think VideoTracks with kind=caption are confusing too, and we should
> avoid creating more situations where we need to do that.
>
> Even when we only have one video, it's confusing that captions could
> exist in multiple places.
>
>>>   * What if the video track's kind isn't "main"? For example, what if we
>>>     have a sign language track and we also want to display captions?
>>>     What is the generated track's kind?
>> How would that work? Are you saying we're not displaying the main
>> video, but only displaying the sign language track? Is that realistic
>> and something anybody would actually do?
> It's possible, so the spec should handle it. Maybe it doesn't matter though?
>
>>>   * The "language" attribute could also have conflicts.
>> How so?
> The underlying streams could have their own metadata, and it could
> conflict. I'm not sure if it would ever be reasonable to author a file
> like that, but it would be trivial to create. At the very least, we'd
> need language to say which takes precedence if the two streams have
> conflicting metadata.
>
>>>   * I think it might also be possible to create files where the video
>>>     track and text track are different lengths, so we'd need to figure
>>>     out what to do when one of them ends.
>> The timeline of a video is well defined in the spec - I don't think we
>> need to do more than what is already defined.
> What I mean is that this could be confusing for users. Say I'm watching
> a video with two video streams (main camera angle, secondary camera
> angle) and two captions tracks (for sports for example). If I'm watching
> the secondary camera angle and looking at one of the captions tracks,
> but then the secondary camera angle goes away, my player is now forced
> to randomly select one of the caption tracks combined with the primary
> video, because it's not obvious which one corresponds with the captions
> I was reading before.
>
> In fact, if I was making a video player for my website where multiple
> people give commentary on baseball games with multiple camera angles, I
> would probably create my own controls that parse the video track ids and
> separates them back into video and text tracks so that I could have
> offer separate video and text controls, since combining them just makes
> the UI more complicated.

That's what I meant with multiple video tracks: if you have several
that require different captions, then you're in a world of hurt in any
case and this has nothing to do with whether you're representing the
non-cue-exposed caption tracks as UARendered or as a video track.


> So, what's the advantage of combining video and captions, rather than
> just indicating that a text track can't be represented as TextTrackCues?

One important advantage: there's no need to change the spec.

If we change the spec, we still have to work through all the issues
that you listed above and find a solution.

Silvia.
Received on Monday, 3 November 2014 23:41:45 UTC