Re: [TTML-WEBVTT] How to map multiple <p> that share same time range to WebVTT from Boy van Dijk on 2021-11-12 (public-tt@w3.org from November 2021)

From: Boy van Dijk <boy@unified-streaming.com>
Date: Fri, 12 Nov 2021 04:43:01 -0800
To: Pierre-Anthony Lemieux <pal@sandflow.com>
Cc: TTWG <public-tt@w3.org>
Message-ID: <CADxNcnLmxkO7LgvRnXuPeW=3hqhmBJ15VhNgBKp705r9mgW2Lg@mail.gmail.com>
 Hi Pierre,

Thanks for your response and sorry for leaving it waiting a little bit. I
anticipated there would perhaps be more opinions on this.

Unfortunately, I believe my initial message might not have been entirely
clear because the original formatting was removed. For example:

"Every <p> is mapped to a WebVTT cue."

Is not something I wrote myself like it might have seemed, but a direct
quote from the TTML WebVTT mapping spec (
https://w3c.github.io/ttml-webvtt-mapping/).

As you say this quote might not represent the right strategy:

- Does that mean you're of the opinion that the specification should be
changed in some way?
- Or are you saying I'm somehow missing something relevant in the
specification (which might very well be these case!)?

As for the second option, and the strategy you propose, using ISDs: there
seems to be no reference to ISDs in the document, so I'm not sure how a
person reading this specification should know that a conversion to ISDs
needs to take place first, or that ISDs play some other role in the
conversion process. Although working with ISDs might very well be the
better approach of course, as you indicate.

Please let me know your thoughts.

Regards,
Boy

On 4 Nov 2021 at 01:23:48, Pierre-Anthony Lemieux <pal@sandflow.com> wrote:

> Hi Boy,
>
> Every <p> is mapped to a WebVTT cue.
>
>
> I am not convinced that this is the right strategy.
>
> I would instead map each intermediate synchronic document (ISD) to a
> WebVTT cue, so that no two cues have overlapping temporal ranges.
>
> This is the approach taken by https://github.com/sandflow/ttconv.
>
> Best,
>
> -- Pierre
>
>
>
> On Wed, Nov 3, 2021 at 4:16 PM Boy van Dijk <boy@unified-streaming.com>
> wrote:
>
>
> Hi,
>
>
> I represent Unified Streaming and I'm seeking your expertise about mapping
> to WebVTT the following bit of TTML:
>
>
> <div begin="00:00:46.320" end="00:00:48.360">
>
>     <p style="singleHeightStyle" tts:textAlign="center" region="region-20">
>
>         <span
> xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
>
>         <span xml:space="preserve"
> tts:backgroundColor="black">&#xA0;You&#xA0;guys&#xA0;wanna&#xA0;story?&#xA0;</span>
>
>         <span
> xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
>
>     </p>
>
>     <p style="singleHeightStyle" tts:display="inlineBlock"
> xml:space="preserve" region="region-20">&#xA0;</p>
>
>     <p style="singleHeightStyle" tts:textAlign="center" region="region-20">
>
>         <span
> xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
>
>         <span xml:space="preserve"
> tts:backgroundColor="black">&#xA0;(MEN&#xA0;CHEERING)&#xA0;</span>
>
>         <span
> xml:space="preserve">&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;</span>
>
>     </p>
>
> </div>
>
>
>
> Or, my problem in simplified form:
>
>
> <p begin="00:00:46.320" end="00:00:48.360">
>
> You guys wanna story?
>
> </p>
>
> <p begin="00:00:46.320" end="00:00:48.360">
>
> (MEN CHEERING)
>
> </p>
>
>
>
> From what I understand of the spec, what needs to happen is pretty easy,
> because:
>
>
> Every <p> is mapped to a WebVTT cue.
>
>
>
> Which would result in:
>
>
> 00:00:46.320 --> 00:00:48.360
>
> You guys wanna story?
>
>
> 00:00:46.320 --> 00:00:48.360
>
> (MEN CHEERING)
>
>
>
> Considering that these two cues have the exact same start and end time,
> does their sequence carry meaning? I'm not sure, but I believe this result
> is far from ideal, especially after applying the very last step listed in
> the steps to convert TVTT to WebVTT:
>
>
> The last step is to sort these cues from earliest to latest time, based on
> each cue's beginning timestamp.
>
>
>
> The result of which will be that either one of the cues listed is first
> with the other listed second. Randomness, it seems.
>
>
> Okay, long email you might think but does this actually have any practical
> implications? Yes! If you play this back in a recent native HLS player on
> an Apple device you get completely unusable results (not all cues are
> presented and the one that are, aren't necessarily presented at the right
> time either).
>
>
> So, I believe my questions are the following:
>
>
> Is my understanding of the mapping spec as I presented it above correct?
>
> If my understanding is correct, does my example simply represent an edge
> case that isn't properly covered by the spec, or is only Apple's WebVTT
> parser to blame here?
>
> If this is an edge case not covered by the spec, what would be the way
> forward?
>
>
>
> Happy to hear your input and thanks you for your thoughts.
>
>
> For those interested, I created a simple Tears of Steel-based test stream
> without audio:
> https://origin.unified-streaming.com/public/tkt32756/main.m3u8
>
>
> It contains the following WebVTT:
>
>
> WEBVTT
>
>
> 00:00:05.000 --> 00:00:05.000
>
> You guys wanna story?
>
>
> 00:00:05.000 --> 00:00:05.000
>
> (MEN CHEERING)
>
>
> 00:00:05.500 --> 00:00:10.500
>
> You guys wanna story?
>
>
> 00:00:05.500 --> 00:00:10.500
>
> (MEN CHEERING)
>
>
> 00:00:11.500 --> 00:00:15.500
>
> You guys wanna story?
>
>
> 00:00:11.500 --> 00:00:15.500
>
> (MEN CHEERING)
>
>
>
> Regards,
>
> Boy
>
>
Received on Friday, 12 November 2021 12:44:16 UTC