Re: WebVTT, Regions and live streams. from May, Bill on 2015-09-30 (public-texttracks@w3.org from September 2015)

From: May, Bill <bill.may@mlb.com>
Date: Wed, 30 Sep 2015 17:14:25 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: "public-texttracks@w3.org" <public-texttracks@w3.org>, "Zurat, Bill" <Bill.Zurat@mlb.com>
Message-ID: <83A6B3D6-E268-464A-AE49-EE24A19F14D0@mlb.com>
Hi, Silvia, 

Thanks for your replies.

The questions that I’m raising are not for VOD, but for live streams, where you do not know the end times of the cue, nor do you know ahead of time if a clear is needed.

Responses inline.

> On Sep 29, 2015, at 9:53 PM, Silvia Pfeiffer <silviapfeiffer1@gmail.com> wrote:
> 
> Hi Bill,
> 
> replies inline.
> 
> On Fri, Sep 18, 2015 at 1:34 PM, May, Bill <bill.may@mlb.com> wrote:
>> Hello,
>> 
>> My name is Bill May; I’m an engineer at MLB Advanced Media looking into how to expand Closed-Captions into a world-wide solution.
>> 
>> We’re looking for a user experience similar to closed captions; the ability to have 1 word/letter at a time displayed, support for roll-up type caption for both VOD and live (event and linear) presentations.  We use HTTP Live Streaming (HLS) as our video protocol.
>> 
>> CEA-708 captions are are one possible solution, but we believe that it will get harder as we add more languages.  We’d like to unbundle the captions from the other media.
>> 
>> So, it will come as little surprise that I’m thinking of webVTT.
>> 
>> With the later versions of the specification, WebVTT has done an excellent job of translating 608/708 and the properties required into webVTT using the region attributes, but only for a completed VOD type presentation.
>> 
>> Solutions like HLS and DASH use a duration based segmentation to provide (near) live streams.  When we need to provide webVTT cues for a live stream, the direction isn’t very clear.  Specifically:
>> 
>> 1). How to handle a EDM (clear screen).
>> 2). what to do at the end of a segment/beginning of the next when the closed caption line spans the segment.
>> 
>> case 1: (EDM)
>> When we use the region syntax, I have been assuming is that each cue gets a start time, and an end time that encompasses the 16 seconds maximum time that the closed caption specifications state.
>> 
>> That way, as the captions are added, the oldest ones will roll out of the region, even if they have time left.  If captions aren’t added, they have (Please correct me if my assumption of the region is wrong).
>> 
>> However, that doesn’t let me enter a clear screen command.  There’s no way to change the end time of those earlier cues.
>> 
>> One possible solution for this is to add a bunch of short lived cues with non-breaking space, but I do not believe that this is acceptable due to the background artifacts.
> 
> 
> EDM (clear screen) isn't really something that maps easily to a
> file-based format. It comes from a time where captions were
> command-based. The way to achieve a "clear screen" is to carefully
> author the cue durations such that at the time that the EDM is
> supposed to happen, that is the end time of the cues. It has to be
> pre-authored and can't be changed later. If you don't add any more
> captions to a region, the old ones will disappear and the newer ones
> will simply not scroll up, since they don't get pushed by new ones.
> When the last one ends, the region is cleared.

WebVTT is not only a file-based format; it is being used for live captioning.

When doing a live stream, one does not know the end time of the cue; you cannot preauthor the clues completely.

An attribute to clear the region would easily accomplish this.

> 
> 
> 
>> case 2 (cues that overlap segmentation)
>> As for the segmentation, assume we have the following: Assuming a region with 2 lines, and let’s say we want to push out each word every 3/4 second (by using the timing mechanisms).  (I’ve left out any cue settings to make it a bit clearer)
>> 
>> 00:00:00.000 —-> 00:00:17.500
>> Caption <00:00:00.750>Line <00:00:01.500>1
>> 
>> 00:00:03.000 —> 00:00:22.000
>> Caption <00:00:03.750>Line <00:00:04.5>2<00:00:5.25>Longer<00:00:6.000>extended
>> 
>> If we need to provide segmentation at 5 second intervals, the segmentation process does not have the words “Longer” and “Extended”; they haven’t been entered by the captioner.
> 
> 
> Why do you need to provide segmentation at 5 second intervals?

That is the way that the protocols for live streaming work.

> 
> 
>> It also doesn’t know what the end time should be, as it should be 16 seconds from the end of the last word.
> 
> You can choose the end time to be whatever suits best. Is your cue now
> supposed to be a max of 5 seconds long or 16 seconds long?

Closed caption cues are supposed to last a maximum of 16 seconds, I believe.  Remember, at the time the cue starts, we do not know how long it will take.

> 
> 
>> Creating a new cue that starts for “Longer” and “extended” will cause the “Caption Line 2” part to scroll; we want it to continue on the same line.
> 
> 
> That's not the problem of the cue, but the problem that you want to
> segment at 5 second intervals.

Exactly.  We need a solution.

> 
> 
>> I’d like to know if there is a clear solution to these problems, and if not, if additions to the specification can be added to handle these cases.
> 
> It should all work with the adjustment of the end time as needed,
> unless I am missing something.

We cannot go back and reauthor the segments that have already been generated to change the end time, nor do we have a way to append to a cue.

To accomplish live (like for HLS today), one uses a series of webVTT files that create a stream.  The WebVTT is not a single file, but a series. (Imagine a live linear stream with no beginning and end).  The segmentation (especially for live) occurs at regular intervals in order to stay in sync and present a live stream.  Cues can span segments.

We cannot wait until the cue has completed to add it to the stream; that would add too much latency and might violate the requirements of captioning.

I believe that for live streaming, we would need 2 things:
 - a way to specify EDM
 - a way to continue a cue across segments.

Thanks,
Bill May
Received on Wednesday, 30 September 2015 17:15:03 UTC