[media-source] Coded Frame Processing is unspecified when incoming coded frame's presentation timestamp coincides with an existing frame's (#375) from Jean-Yves Avenard via GitHub on 2026-05-13 (public-html-media@w3.org from May 2026)

From: Jean-Yves Avenard via GitHub <noreply@w3.org>
Date: Wed, 13 May 2026 12:04:51 +0000
To: public-html-media@w3.org
Message-ID: <issues.opened-4437690490-1778673888-noreply@w3.org>
jyavenard has just created a new issue for https://github.com/w3c/media-source:

== Coded Frame Processing is unspecified when incoming coded frame's presentation timestamp coincides with an existing frame's ==
The [Coded Frame Processing](https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing) algorithm does not define what happens when the incoming coded frame's presentation timestamp coincides (exactly, or within the 1-microsecond tolerance the algorithm already recognizes elsewhere) with an existing coded frame's presentation timestamp in the track buffer, in mid-stream appending.

The gap is reachable only when all three of the following hold at the time the incoming frame is processed:

1. `last decode timestamp` is set — so the "`last decode timestamp` unset" overlap check is gated out.
2. `highest end timestamp` is set and strictly greater than `presentation timestamp` — so neither branch of "Remove existing coded frames in track buffer" fires (the first branch requires `highest end timestamp` unset; the second requires `highest end timestamp ≤ presentation timestamp`).
3. An existing coded frame has presentation timestamp equal (or within 1 microsecond of) the incoming coded frame's.

Under those three conditions, none of steps 1.13, 1.14, or 1.15 targets the colliding existing frame. The "Add the coded frame" step then inserts the incoming frame alongside it, leaving two coded frames at the same presentation timestamp. The track buffer model and the "Add the coded frame" algorithm do not currently specify what happens when an incoming coded frame's presentation timestamp compares equal (including within the algorithm's 1-microsecond tolerance) to an existing coded frame's presentation timestamp.
Two implementations making opposite choices — one enforcing a single coded frame per presentation timestamp (by replacing the existing one, dropping the incoming one, or signalling an error) and one allowing coded frames with identical presentation timestamps to coexist — are both conformant under the current text, and produce different user-visible behaviour on the same content.
Downstream algorithms keyed on presentation timestamp (seeking, `buffered` reporting, duration change) return ambiguous results when two coded frames share the same presentation timestamp.

`abort()` and other paths that run the Reset Parser State algorithm unset `last decode timestamp` and `highest end timestamp` (spec §"Reset Parser State", bullets 2, 3, and 4), so any incoming frame after such a reset goes through the "`last decode timestamp` unset" overlap check, which resolves the collision correctly via the 1-microsecond window. Those paths do **not** reach this gap. Only continuous mid-stream appending, or other paths that leave `last decode timestamp` and `highest end timestamp` both set, can produce the collision.

Triggering shapes include:

- Fragmented MP4 whose edit-list or mid-stream composition-time-offset structure lands an incoming frame's effective presentation timestamp on an already-buffered frame's presentation timestamp, across a fragment boundary but within a single coded frame group (no DTS discontinuity).
- Open-GOP / bidirectional-prediction content where a new GOP's random-access point has a presentation timestamp coinciding with a still-buffered frame from the prior GOP.
- Content produced by segmenters that, for alignment or splicing reasons, emit overlapping presentation timestamps at fragment boundaries without resetting state.

This is a separate spec gap from #187 (which is scoped to "SAP Type 2" decode-shadowed orphans and does not cover presentation-timestamp coincidence at positions where the existing cleanup branches do not fire).

## Worked example

Timestamps are seconds-valued doubles (the representation the algorithm defines). Frame durations are 40 ms; no two presentation intervals overlap.

**Segment 1** — two adjacent GOPs already in the track buffer:

| Frame | Decode Timestamp | Presentation Timestamp | Duration | Type | Presentation Interval |
|:-----:|-----------------:|-----------------------:|---------:|:----:|:---------------------:|
| F1 | 1.000 | 1.080 | 0.040 | sync (I) | `[1.080, 1.120)` |
| F2 | 1.040 | 1.040 | 0.040 | non-sync | `[1.040, 1.080)` |
| F3 | 1.080 | 1.120 | 0.040 | sync (next-GOP I) | `[1.120, 1.160)` |
| F4 | 1.120 | 1.160 | 0.040 | non-sync | `[1.160, 1.200)` |

The MSE algorithm distinguishes only sync vs non-sync frames; the codec-side interpretation of why each frame's presentation timestamp stands in a given relationship to its decode timestamp is immaterial to the algorithm's behaviour. Any frame shape that satisfies the three triggering conditions (listed above) reaches this gap.

After Segment 1 (no reset has run since): `last decode timestamp` = 1.120, `last frame duration` = 0.040, `highest end timestamp` = 1.200, `need random access point flag` = false (cleared when F1 was processed).

**Segment 2** — the incoming coded frame, with presentation timestamp colliding with F2's:

| Frame | Decode Timestamp | Presentation Timestamp | Duration | Type |
|:-----:|-----------------:|-----------------------:|---------:|:----:|
| incoming | 1.160 | 1.040 | 0.040 | sync |

The incoming frame's decode timestamp (1.160) > `last decode timestamp` (1.120), and `1.160 − 1.120 = 0.040` is **not** greater than `2 × last frame duration = 0.080` — so the DTS-discontinuity step does **not** fire. No state variables are reset. The incoming frame's presentation timestamp (1.040) equals F2's.

### Current spec trace

1. `need random access point flag` is false (cleared when F1 was processed earlier in Segment 1) → no gate.
2. `last decode timestamp` is set → skip overlap check.
3. **Remove existing coded frames**:
    - First branch requires `highest end timestamp` unset — it is set (1.200). Skip.
    - Second branch requires `highest end timestamp ≤ presentation timestamp`, i.e., `1.200 ≤ 1.040`. False. Skip.
4. **Remove decoding dependencies**: no frames removed above → no-op.
5. **Add the coded frame**: the incoming frame is added.

Resulting buffer (presentation order):

```
F2       (pres=1.040, dur=0.040, non-sync)  ┐
incoming (pres=1.040, dur=0.040, sync)      ┘ both at presentation timestamp 1.040
F1       (pres=1.080, dur=0.040, sync)
F3       (pres=1.120, dur=0.040, sync)
F4       (pres=1.160, dur=0.040, non-sync)
```

F2 and the incoming frame share presentation timestamp 1.040. Algorithms that refer to "the coded frame in track buffer with a presentation interval that contains *t*" (used by the overlap step and by several sibling algorithms) return a singular coded frame — but two now exist at presentation timestamp 1.040. Conformant implementations resolve this differently, producing different user-visible behaviour on the same content.

### Amended spec trace

The new step runs before "Remove existing coded frames":

> Remove all coded frames whose presentation timestamp is within 1 microsecond of 1.040.

Applied to each existing frame:

| Frame | Presentation Timestamp | within 1µs of 1.040? | Action |
|:-----:|-----------------------:|:--------------------:|:------:|
| F1 | 1.080 | no | preserved |
| F2 | 1.040 | yes | **removed** |
| F3 | 1.120 | no | preserved |
| F4 | 1.160 | no | preserved |

Dependency sweep: next random access point after F2 in decode order is F3. Decode-order range strictly between `{F2}` and F3 is empty; no new removals.

Resulting buffer (presentation order):

```
incoming (pres=1.040, dur=0.040, sync)       ← single frame at presentation timestamp 1.040
F1       (pres=1.080, dur=0.040, sync)
F3       (pres=1.120, dur=0.040, sync)
F4       (pres=1.160, dur=0.040, non-sync)
```

Exactly one coded frame at presentation timestamp 1.040 — the incoming frame. Downstream algorithms that refer to "the coded frame … containing *t = 1.040*" have an unambiguous answer.
Implementations converge on identical buffer state.

## Proposed amendment

Insert a new step into the per-coded-frame loop, immediately after the existing "If `last decode timestamp` for `track buffer` is unset and `presentation timestamp` falls within the presentation
  interval of a coded frame in track buffer" step, and before the "Remove existing coded frames in track buffer" step:

```html
<li>Remove all coded frames from |track buffer| whose [=presentation timestamp=] is within 1 microsecond of |presentation timestamp|.
  <p class="note">
    This uses the same 1-microsecond tolerance as the `last decode timestamp unset` overlap step earlier in this algorithm, and for the same reason: to compensate for minor errors in frame
timestamp computations that can appear when converting back and forth between double precision floating point numbers and rationals. After this step, the track buffer cannot hold two coded
frames sharing the same [=presentation timestamp=], so the subsequent "Add the coded frame … to the track buffer" step unambiguously makes the incoming coded frame the one at that [=presentation
  timestamp=].
  </p>
</li>
```

Update the immediately-following "Remove all possible decoding dependencies …" step to extend its sweep to include frames removed by this new step:

```diff
-  Remove all possible decoding dependencies on the coded frames removed in the previous two steps by removing all coded frames from |track buffer| between those frames removed in the previous
two steps and the next random access point after those removed frames.
+  Remove all possible decoding dependencies on the coded frames removed in the previous three steps by removing all coded frames from |track buffer| between those frames removed in the previous
  three steps and the next random access point after those removed frames.
```

### Interaction with #187

If the amendment proposed in #187 lands first, merge the predicate from this issue into the step added there (combining both clauses into a single `<li>` list of conditions) and keep the dependency-cleanup step's "previous three steps" wording. If this issue lands first, the #187 amendment should do the same in reverse.

### Scope and side effects

The step is a no-op when no existing coded frame is within 1 microsecond of the incoming presentation timestamp, which is the common case in continuous appending.

As with the other removal steps, step 1.15's conservative dependency sweep may remove additional non-sync coded frames that lie between the removed frame and the next random access point in decode order. This is a pre-existing property of step 1.15, not introduced by this amendment.

### What this does not cover

Coded frames with the same decode timestamp but different presentation timestamps are not addressed by this amendment. The spec's current storage model tolerates them, and they do not produce cross-implementation divergence of the same kind as the presentation-timestamp collision this issue describes.


Please view or discuss this issue at https://github.com/w3c/media-source/issues/375 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Wednesday, 13 May 2026 12:04:55 UTC