[media-source] Coded Frame Processing does not remove "SAP Type 2" decode-shadowed orphans (closes #187) (#374) from Jean-Yves Avenard via GitHub on 2026-05-13 (public-html-media@w3.org from May 2026)

From: Jean-Yves Avenard via GitHub <noreply@w3.org>
Date: Wed, 13 May 2026 11:53:58 +0000
To: public-html-media@w3.org
Message-ID: <issues.opened-4437613738-1778673235-noreply@w3.org>
jyavenard has just created a new issue for https://github.com/w3c/media-source:

== Coded Frame Processing does not remove "SAP Type 2" decode-shadowed orphans (closes #187) ==
The [Coded Frame Processing](https://w3c.github.io/media-source/#sourcebuffer-coded-frame-processing) algorithm's cleanup steps leave a gap for the "SAP Type 2" / out-of-order decode case originally raised in #187.

When the incoming coded frame is a [=random access point=] and the track buffer contains coded frames whose decode timestamp is greater than or equal to the incoming coded frame's decode timestamp and whose presentation timestamp is less than the incoming coded frame's presentation timestamp (the classic SAP Type 2 / open-GOP shape), those frames survive every cleanup step in the current algorithm:

- The "`last decode timestamp` unset" overlap check targets only the single coded frame whose presentation interval contains `presentation timestamp`, and is further gated by `last decode timestamp` being unset.
- The "Remove existing coded frames in track buffer" step's two branches remove frames with presentation timestamp greater than or equal to `presentation timestamp` (when `highest end timestamp` is not set) or greater than or equal to `highest end timestamp` (when it is set and less than or equal to `presentation timestamp`). Decode-shadowed orphans have presentation timestamps _below_
  the incoming coded frame's, not above, so neither branch targets them.
- The "Remove all possible decoding dependencies" step walks forward from the set of frames removed in the previous two steps; it does not walk backward toward the incoming coded frame's decode
timestamp.

After "Add the coded frame with the presentation timestamp, decode timestamp, and frame duration to the track buffer", the track buffer contains coded frames that will decode after the incoming random access point but present before it. No random access point remains at or before their presentation timestamps once the incoming one takes over. They become unreachable on seek and can produce decoder state mismatches when played back.

Any non-sync-coded content with out-of-order presentation timestamps appended after a discontinuity whose new random access point lands inside an existing GOP's presentation range reaches this gap.

The analysis and example below scope to the post-discontinuity case — the DTS-discontinuity step has fired and `highest end timestamp` is unset. The third implicit case of "Remove existing coded frames" (where `highest end timestamp` is set and strictly greater than `presentation timestamp`) can exhibit related symptoms; this issue does not cover that case.

## Worked example

Timestamps are seconds-valued doubles (the representation the algorithm defines). Frame durations are 40 ms, matching a ~25 fps cadence; no two presentation intervals overlap.

**Segment 1** — a GOP with out-of-order decode already in the track buffer:

| Frame | Decode Timestamp | Presentation Timestamp | Duration | Type | Presentation Interval |
|:-----:|-----------------:|-----------------------:|---------:|:----:|:---------------------:|
| F1 | 1.000 | 1.080 | 0.040 | sync (I) | `[1.080, 1.120)` |
| F2 | 1.040 | 1.000 | 0.040 | non-sync | `[1.000, 1.040)` |
| F3 | 1.080 | 1.040 | 0.040 | non-sync | `[1.040, 1.080)` |
| F4 | 1.120 | 1.120 | 0.040 | sync (next-GOP I) | `[1.120, 1.160)` |

The MSE algorithm distinguishes only sync vs non-sync frames. Whether F2 and F3 are P- or B-frames in the codec sense (referencing only F1, or F1 and some future frame) is immaterial here; any non-sync frame with this decode-timestamp / presentation-timestamp shape exhibits the issue.

Decode order is F1 → F2 → F3 → F4. Presentation order is F2 → F3 → F1 → F4. Every frame has its own 40 ms slot; intervals do not overlap, so the overlap check below unambiguously returns a single coded frame.

After Segment 1: `last decode timestamp` = 1.120, `highest end timestamp` = 1.160, `last frame duration` = 0.040.

**Segment 2** — the incoming coded frame:

| Frame | Decode Timestamp | Presentation Timestamp | Duration | Type |
|:-----:|-----------------:|-----------------------:|---------:|:----:|
| incoming | 0.980 | 1.040 | 0.020 | sync |

The incoming frame's decode timestamp (0.980) < Segment 1's `last decode timestamp` (1.120), so the DTS-discontinuity step fires and resets `last decode timestamp`, `last frame duration`, and `highest end timestamp`. It also sets `need random access point flag` to true on every track buffer. The incoming frame is reprocessed from Loop Top; the need-RAP gate then forces the next processed frame to be a random access point, which the incoming frame is.

### Current spec trace

1. `need random access point flag` is set; the incoming frame is a random access point → clear the flag.
2. **Overlap check** (`last decode timestamp` is unset after the reset): the presentation interval containing 1.040 is F3's `[1.040, 1.080)` (no other interval contains 1.040, so the match is unambiguous). Video branch: `remove window = 1.040 + 1µs = 1.040001`; `1.040 < 1.040001` → remove F3.
3. **Remove existing coded frames** (first branch, `highest end timestamp` not set): remove frames with presentation timestamp in `[1.040, 1.060)`. Only F3 qualifies; already removed.
4. **Remove decoding dependencies**: frames removed so far = `{F3}`. Next random access point after F3 in decode order = F4. Decode-order range strictly between `{F3}` and F4 is empty; no new removals.
5. **Add the coded frame**: the incoming frame is added.

Resulting buffer (decode order):

```
incoming (decode=0.980, pres=1.040, dur=0.020, sync)
F1       (decode=1.000, pres=1.080, dur=0.040, sync)
F2       (decode=1.040, pres=1.000, dur=0.040, non-sync)  ← ORPHAN
F4       (decode=1.120, pres=1.120, dur=0.040, sync)
```

F2 has presentation timestamp 1.000, below every remaining random access point's presentation timestamp (incoming at 1.040, F1 at 1.080, F4 at 1.120). On seek to `t = 1.000` there is no random access point at or before that time. F2 is a presentation-time orphan.

### Amended spec trace

The new step runs before "Remove existing coded frames":

> Remove all coded frames whose `decode timestamp ≥ 0.980` and whose `presentation timestamp < 1.040`.

Applied to each existing frame:

| Frame | decode ≥ 0.980? | presentation < 1.040? | Action |
|:-----:|:---------------:|:---------------------:|:------:|
| F1 | yes | no (1.080) | preserved |
| F2 | yes | yes (1.000) | **removed** |
| F4 | yes | no (1.120) | preserved |

(F3 was already removed by the overlap step.)

Dependency sweep: next random access point after the removed frames in decode order is F4. Decode-order range strictly between `{F2, F3}` and F4 is empty; no new removals.

Resulting buffer (decode order):

```
incoming (decode=0.980, pres=1.040, dur=0.020, sync)
F1       (decode=1.000, pres=1.080, dur=0.040, sync)
F4       (decode=1.120, pres=1.120, dur=0.040, sync)
```

Every remaining frame is a random access point. No orphans. Seek to any time from `t = 1.040` onward resolves to a random access point at or before it.

## Proposed amendment

Insert a new step into the per-coded-frame loop, immediately after the existing "If `last decode timestamp` for `track buffer` is unset and `presentation timestamp` falls within the presentation interval of a coded frame in track buffer" step, and before the "Remove existing coded frames in track buffer" step:

```html
<li>Remove all coded frames from |track buffer| whose decode timestamp is greater than or equal to |decode timestamp| and whose [=presentation timestamp=] is less than |presentation timestamp|.
</li>
```

Update the immediately-following "Remove all possible decoding dependencies …" step to extend its sweep to include frames removed by this new step:

```diff
-  Remove all possible decoding dependencies on the coded frames removed in the previous two steps by removing all coded frames from |track buffer| between those frames removed in the previous
two steps and the next random access point after those removed frames.
+  Remove all possible decoding dependencies on the coded frames removed in the previous three steps by removing all coded frames from |track buffer| between those frames removed in the previous
  three steps and the next random access point after those removed frames.
```

### Why this predicate

The new step removes coded frames that satisfy **both** conditions:

1. Decode timestamp is greater than or equal to the incoming coded frame's decode timestamp. Frames with decode timestamp less than the incoming's were decoded before the incoming coded frame's decoder state takes over and remain legitimate prior content.
2. Presentation timestamp is less than the incoming coded frame's presentation timestamp. Frames with presentation timestamp greater than or equal to the incoming's are legitimate next-GOP content (for example, F1 in the worked example: its decode timestamp is after the incoming's, but its presentation timestamp is after the incoming's too, so it is preserved as valid content).

The conjunction targets exactly the orphaned decode-shadowed frames: decoded after the incoming [=random access point=] but presented before it, with no remaining [=random access point=] at or before their presentation timestamps once the incoming one takes over.

### Scope and side effects

The new step is a no-op in continuous (non-discontinuous) appending. Coded frames in a single coded frame group have monotonically increasing decode timestamps (the DTS-discontinuity step fires on `decode timestamp < last decode timestamp`, permitting equality without reset), so by the time the incoming frame is processed, existing frames in the track buffer generally have decode timestamp less than the incoming's.
The step therefore only does real work after the DTS-discontinuity step has fired, or in the edge case of a decode-timestamp tie at a coded-frame-group boundary — exactly the scenarios this gap affects.

The step does not need to gate on "the incoming coded frame is a [=random access point=]". By the time the step is reached after a DTS-discontinuity reset, the `need random access point flag` has already forced the incoming frame to be a random access point, or the frame has been dropped. Gating would be redundant.

The amendment does not change `last decode timestamp` semantics. After "Add the coded frame", `last decode timestamp` is set to the incoming frame's decode timestamp (0.980 in the worked example), which is less than a decode timestamp already in the buffer (F1's 1.000). This asymmetry already exists in the current algorithm — `last decode timestamp` tracks the most recently appended frame's decode timestamp, not the maximum across the track buffer — and the amendment neither creates nor worsens it.


Please view or discuss this issue at https://github.com/w3c/media-source/issues/374 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Wednesday, 13 May 2026 11:54:02 UTC