WebCodecs - Incoming liaison letter from 3GPP SA4

Hello Media Working Group participants,

W3C received a liaison letter for the Media Working Group from the 
3GPP  TSG SA WG4 (SA4)  with questions on the WebCodecs API. See 
attached file for the full letter. I'm reproducing the questions in text 
format below to ease review.

[[
IVAS codec is designed to handle multiple spatial input formats 
(Multi-channel / Objects / Ambisonics / Parametric (Metadata Assisted 
Spatial Audio, MASA)) and the IVAS decoder integrates a renderer which 
is capable of rendering to various output formats like binaural and 
loud-speaker layouts. Time aligned metadata serves a critical function 
in both the encoder and decoder. In IVAS Encoder, time aligned metadata 
is needed for encoding object-based audio, MASA and combined formats. 
Similarly, IVAS Decoder can ingest head orientation sensor metadata to 
render a fully immersive head-tracked binaural audio experience. The 
decoder may also need to manage Processing Information metadata when 
available in an RTP payload to configure the rendering.

AudioDecoder Interface
-----
- Currently the configure() API defines AudioDecoderConfig to only allow 
for the description field, a sequence of codec specific bytes, which 
commonly maps to the initial one-time setup of various codecs (e.g. 
Audio Specific Config in AAC, header/codebook data in Vorbis, etc). 
However, it does not provide a mechanism for any run-time controls for 
the decoder (e.g. an output format to configure decoded output). The 
AudioEncoder interface allows for codec-specific extensions to 
AudioEncoderConfig as a dictionary of configurations, while 
AudioDecoderConfig doesn’t.
Question: What is the correct way to implement run-time controls like 
the output format (e.g., BINAURAL, Stereo (2.0), 5.1.2, 7.1.4) in the 
decoder?

- Time-varying metadata input for the decoder, either from the device 
sensors (e.g. user head orientation) or with out-of-band signalling 
(e.g. from RTP / Media Container), may be needed for the proper 
integration of IVAS’s immersive features, however currently there is no 
direct way to provide this to the decode() API.
Questions: How can time-aligned codec-specific metadata be injected into 
the decode() call without adding multiple configure() calls per frame?  
If multiple configure() calls are used per frame, would the configure 
and decode calls be synchronously processed?

- A decoder might need to produce additional time-aligned codec-specific 
metadata (e.g. object metadata) as output when an external rendering is 
desired. Currently the AudioDataOutputCallback only allows for an 
AudioData interface, but there is no way for the decode() API to produce 
any additional metadata output. This contrasts with the 
EncodedAudioChunkOutputCallback which implements an optional the 
EncodedAudioChunkMetadata field.
Question: How can the AudioDecoder interface provide time-aligned 
metadata and PCM audio output?

AudioEncoder Interface
-----
-  Unlike the decoder, the encoder configure() API allows for 
codec-specific extensions to AudioEncoderConfig, however, for certain 
immersive formats like object-based audio or MASA, the encoder requires 
additional time varying metadata that is time-aligned with the input audio.
Question: How can such metadata input be injected into the Audio Encoder 
in a synchronous manner to ensure time-aligned application within the 
encoder?
]]

Received on Tuesday, 6 January 2026 15:59:58 UTC