[media-source] Accidental trimming of overlapping text cues (#363)

ntrrgc has just created a new issue for https://github.com/w3c/media-source:

== Accidental trimming of overlapping text cues ==
Let's consider the following WebVTT stream:

```webvtt
WEBVTT

0:00:00.000 --> 0:00:10.000 position:10%
Cue A

0:00:01.000 --> 0:00:02.000 position:80%
Cue B

0:00:05.000 --> 0:00:06.000 position:80%
Cue C
```

WebVTT cues can overlap, as they do in the above example. Graphically, this is what the above example looks like:

<img alt="diagram of the cues in a timeline" src="https://github.com/user-attachments/assets/a26ff24f-f735-4a2a-b092-b365633066f5" height="160">

Notably, as far as I've been able to find, the MSE spec has no concept of cues, only coded frames. Also, as far as I've been able to find, **the relation between cues and coded frames is not specified.**

For the remainder of this explanation, **let's assume a 1 to 1 mapping between cues and coded frames**: each cue would be encoded and represented as one coded frame. As a consequence, the presentation intervals of the coded frames can overlap. **This is how WebVTT cues represented in samples in WebM** (in both `S_TEXT/WEBVTT` and `D_WEBVTT/kind` formats).

> [!NOTE]
> This is not the only possible mapping that could be defined between cues and coded frames, nor the least problematic one.
>
> For comparison, when encoded in MP4/ISO BMFF, overlapping WebVTT cues are split into multiple non-overlapping samples. A different mapping could be 1 coded frame representing 1 ISO BMFF WebVTT sample, and this is what the macOS port of WebKit uses.

With the current *Coded Frame Processing* algorithm, this is what would happen for the WebVTT stream above:

* Initially:
    * last decode timestamp = unset
    * last frame duration = unset
    * highest end timestamp = unset
    * Samples: empty
* A coded frame representing Cue A with presentation interval [0, 10)s is processed:
    * last decode timestamp = 0
    * last frame duration = 10
    * highest end timestamp = 10
    * Samples:
        * Cue A: [0, 10)s
* A coded frame representing Cue B with presentation interval [1, 2)s is processed:
    * last decode timestamp = 1
    * last frame duration = 1
    * highest end timestamp = 10
    * Samples:
        * Cue A: [0, 10)s
        * Cue B: [1, 2)s
* A coded frame representing Cue C with presentation interval [5, 6)s is processed:
    * A new coded group is frame is started because, quoting step 1.1.6:
    
        > last decode timestamp for track buffer is set and the difference between decode timestamp and last decode timestamp is greater than 2 times last frame duration

        As such last decode timestamp, last frame duration and highest end timestamp are all unset.
    * **Splicing of Cue A occurs** because, quoting step 1.1.13:
        
        > last decode timestamp for track buffer is unset and presentation timestamp falls within the presentation interval of a coded frame in track buffer

        This changes the duration of Cue A from 10 seconds to 5 seconds.
    * last decode timestamp = 5
    * last frame duration = 1
    * highest end timestamp = 6
    * Samples:
        * Cue A: [0, 5)s
        * Cue B: [1, 2)s
        * Cue C: [2, 3)s

Graphically, this is the result:

<img alt="diagram of the cues in time where Cue A has been unexpectedly trimmed short" src="https://github.com/user-attachments/assets/942857f9-6153-4d6e-9219-299cf72b40a8" height="160">

A potential fix could be amending the check if step 1.1.6 so that it doesn't trigger when the presentation timestamp is less or equal than the highest end timestamp.

Please view or discuss this issue at https://github.com/w3c/media-source/issues/363 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Monday, 10 March 2025 17:19:52 UTC