[TTML-WEBVTT] How to map multiple <p> that share same time range to WebVTT from Boy van Dijk on 2021-11-03 (public-tt@w3.org from November 2021)

From: Boy van Dijk <boy@unified-streaming.com>
Date: Wed, 3 Nov 2021 14:37:26 -0700
To: public-tt@w3.org
Message-ID: <CADxNcn+RFgnJ=gpesGtvp=N9_fGy3JFb+5iZ5XW-sv8DsEvNxA@mail.gmail.com>

Hi,

I represent Unified Streaming and I'm seeking your expertise about mapping
to WebVTT the following bit of TTML:

<div begin="00:00:46.320" end="00:00:48.360">
    <p style="singleHeightStyle" tts:textAlign="center" region="region-20">
        <span xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
        <span xml:space="preserve"
tts:backgroundColor="black">&#xA0;You&#xA0;guys&#xA0;wanna&#xA0;story?&#xA0;</span>
        <span xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
    </p>
    <p style="singleHeightStyle" tts:display="inlineBlock"
xml:space="preserve" region="region-20">&#xA0;</p>
    <p style="singleHeightStyle" tts:textAlign="center" region="region-20">
        <span xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
        <span xml:space="preserve"
tts:backgroundColor="black">&#xA0;(MEN&#xA0;CHEERING)&#xA0;</span>
        <span xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
    </p>
</div>


Or, my problem in simplified form:

<p begin="00:00:46.320" end="00:00:48.360">
You guys wanna story?
</p>
<p begin="00:00:46.320" end="00:00:48.360">
(MEN CHEERING)
</p>


From what I understand of the spec, what needs to happen is pretty easy,
because:

*Every <p> is mapped to a WebVTT cue.*


Which would result in:

00:00:46.320 --> 00:00:48.360
You guys wanna story?

00:00:46.320 --> 00:00:48.360
(MEN CHEERING)


Considering that these two cues have the exact same start and end time,
does their sequence carry meaning? I'm not sure, but I believe this result
is far from ideal, especially after applying the very last step listed in
the steps to convert TVTT to WebVTT:

*The last step is to sort these cues from earliest to latest time, based on
each cue's beginning timestamp.*


The result of which will be that either one of the cues listed is first
with the other listed second. Randomness, it seems.

*Okay, long email you might think but does this actually have any practical
implications? Yes!* If you play this back in a recent native HLS player on
an Apple device you get completely unusable results (not all cues are
presented and the one that are, aren't necessarily presented at the right
time either).

So, I believe my questions are the following:


   - Is my understanding of the mapping spec as I presented it above
   correct?
   - If my understanding is correct, does my example simply represent an
   edge case that isn't properly covered by the spec, or is only Apple's
   WebVTT parser to blame here?
   - If this is an edge case not covered by the spec, what would be the way
   forward?


Happy to hear your input and thanks you for your thoughts.

For those interested, I created a simple Tears of Steel-based test stream
without audio:
https://origin.unified-streaming.com/public/tkt32756/main.m3u8

It contains the following WebVTT:

WEBVTT

00:00:05.000 --> 00:00:05.000
You guys wanna story?

00:00:05.000 --> 00:00:05.000
(MEN CHEERING)

00:00:05.500 --> 00:00:10.500
You guys wanna story?

00:00:05.500 --> 00:00:10.500
(MEN CHEERING)

00:00:11.500 --> 00:00:15.500
You guys wanna story?

00:00:11.500 --> 00:00:15.500
(MEN CHEERING)



Regards,
Boy

Received on Wednesday, 3 November 2021 23:15:27 UTC