- From: Boy van Dijk <boy@unified-streaming.com>
- Date: Wed, 3 Nov 2021 14:37:26 -0700
- To: public-tt@w3.org
- Message-ID: <CADxNcn+RFgnJ=gpesGtvp=N9_fGy3JFb+5iZ5XW-sv8DsEvNxA@mail.gmail.com>
Hi, I represent Unified Streaming and I'm seeking your expertise about mapping to WebVTT the following bit of TTML: <div begin="00:00:46.320" end="00:00:48.360"> <p style="singleHeightStyle" tts:textAlign="center" region="region-20"> <span xml:space="preserve">       </span> <span xml:space="preserve" tts:backgroundColor="black"> You guys wanna story? </span> <span xml:space="preserve">       </span> </p> <p style="singleHeightStyle" tts:display="inlineBlock" xml:space="preserve" region="region-20"> </p> <p style="singleHeightStyle" tts:textAlign="center" region="region-20"> <span xml:space="preserve">          </span> <span xml:space="preserve" tts:backgroundColor="black"> (MEN CHEERING) </span> <span xml:space="preserve">           </span> </p> </div> Or, my problem in simplified form: <p begin="00:00:46.320" end="00:00:48.360"> You guys wanna story? </p> <p begin="00:00:46.320" end="00:00:48.360"> (MEN CHEERING) </p> From what I understand of the spec, what needs to happen is pretty easy, because: *Every <p> is mapped to a WebVTT cue.* Which would result in: 00:00:46.320 --> 00:00:48.360 You guys wanna story? 00:00:46.320 --> 00:00:48.360 (MEN CHEERING) Considering that these two cues have the exact same start and end time, does their sequence carry meaning? I'm not sure, but I believe this result is far from ideal, especially after applying the very last step listed in the steps to convert TVTT to WebVTT: *The last step is to sort these cues from earliest to latest time, based on each cue's beginning timestamp.* The result of which will be that either one of the cues listed is first with the other listed second. Randomness, it seems. *Okay, long email you might think but does this actually have any practical implications? Yes!* If you play this back in a recent native HLS player on an Apple device you get completely unusable results (not all cues are presented and the one that are, aren't necessarily presented at the right time either). So, I believe my questions are the following: - Is my understanding of the mapping spec as I presented it above correct? - If my understanding is correct, does my example simply represent an edge case that isn't properly covered by the spec, or is only Apple's WebVTT parser to blame here? - If this is an edge case not covered by the spec, what would be the way forward? Happy to hear your input and thanks you for your thoughts. For those interested, I created a simple Tears of Steel-based test stream without audio: https://origin.unified-streaming.com/public/tkt32756/main.m3u8 It contains the following WebVTT: WEBVTT 00:00:05.000 --> 00:00:05.000 You guys wanna story? 00:00:05.000 --> 00:00:05.000 (MEN CHEERING) 00:00:05.500 --> 00:00:10.500 You guys wanna story? 00:00:05.500 --> 00:00:10.500 (MEN CHEERING) 00:00:11.500 --> 00:00:15.500 You guys wanna story? 00:00:11.500 --> 00:00:15.500 (MEN CHEERING) Regards, Boy
Received on Wednesday, 3 November 2021 23:15:27 UTC