Re: Inband styling (was Re: Evidence of 'Wide Review' needed for VTT)

On 21/10/2015 14:54, "singer@apple.com on behalf of David Singer"
<singer@apple.com> wrote:

>
>> On Oct 21, 2015, at 15:37 , Nigel Megitt <nigel.megitt@bbc.co.uk> wrote:
>> 
>> On 21/10/2015 14:00, "singer@apple.com on behalf of David Singer"
>> <singer@apple.com> wrote:
>> 
>>> 
>>> No, it’s not hypothetical.  DASH/MP4/VTT relies on this, and it was
>>>(and
>>> is) seen as a core advantage of VTT over TTML.
>> 
>> How curious. Live streaming with DASH/MP4/TTML works splendidly - there
>> were lots of implementations on show at IBC in September, of both coders
>> and presentation systems, based on EBU-TT-D, which is the profile of
>>TTML
>> that is specified for HbbTV 2.0 and the DVB DASH profile. The dash.js
>> player is one. Samsung had a prototype television that was decoding and
>> presenting this format too - I'm pretty sure that there are others in
>>the
>> works. The BBC has prototyped an implementation built on gstreamer that
>> works well also.
>> 
>> What advantage was identified with VTT in this scenario?
>
>Flexible granularity is one.  Live streaming of TTML means short TTML
>documents each of which describe a time interval. This means your segment
>size is basically that, or a multiple of it, and that is also then your
>minimum latency.

I don't think that segment size is equivalent to latency. If you mean that
segment duration sets a minimum latency, there are other things to
consider.

Typically in the DASH case there's an encoding and packaging layer that
accumulates the content to be streamed, segments it, encodes those
segments, packages them and then sends them to a distribution network. In
every case I've seen the video encoding latency is much greater than the
subtitle/caption encoding latency, and the choice of segment duration is
based on what works well for video without having any impact on the
latency for audio or subtitles/captions.

For example, if the encoding pipeline introduces a 16-26s delay to encode
10s long segments (i.e. the delay is 26s for the earliest frame in the
segment and 16s for the latest frame) then you need your live subtitle
encoding pipeline to be able to accumulate and encode subtitle/caption
segments in less time than that. Typically in the UK live broadcast
subtitles are 6-10s later than the broadcast video. So even if you were
hypothetically (and illegally!) doing an off-air receive and encode in
this example you'd still have to insert a delay in the live subtitles to
get them not to appear too early, assuming you choose a 10s segment size
for subtitles too - in practice you could choose an even longer segment
size if you wanted.

I've just looked at our live BBC News HD service subtitles, which are
updated on every word, and the typical time between subtitle updates is
around 0.2 seconds. This is so much less than the sort of segment duration
I'd expect that there's no significant interaction between the two.
Actually there's no interaction at all. Worst case scenario is that a
subtitle that briefly appears at the end of one segment and the beginning
of the next is duplicated in each segment. The visible appearance and
latency are not impacted.

I've used made up but vaguely realistic numbers here, but in every case I
know of it takes longer to encode video than subtitles/captions, so this
scales back to lower latencies too.

The other side of this is: what happens if you don't need to worry about
video encoding? I have a prototype live TTML streaming system that I can
show anyone who is interested at TPAC that transfers TTML documents using
WebSocket. The lesson from that work is that, as long as you have control
over the network paths so that TCP doesn't trade delay for reliability
then this works faster than you can think without imposing any latency
caused by the document format. The latency is all caused by the network,
and the data rates aren't so high that it really matters these days. I
wouldn't distribute subtitles to thousands of subscribers over the
internet using that mechanism, but as a contribution mechanism, e.g. to a
DASH encoder/packager in a closed environment, it would work very well. If
you don't like TCP then I'm sure that e.g. RTP would work just fine too,
with different trade-offs.

Nigel

>
>
>David Singer
>Manager, Software Standards, Apple Inc.
>

Received on Wednesday, 21 October 2015 14:31:36 UTC