First draft of a brief explainer from Nigel Megitt on 2018-07-27 (public-audio-description@w3.org from July 2018)

From: Nigel Megitt <nigel.megitt@bbc.co.uk>
Date: Fri, 27 Jul 2018 11:21:20 +0000
To: "public-audio-description@w3.org" <public-audio-description@w3.org>
Message-ID: <D780BEB4.62AE9%nigel.megitt@bbc.co.uk>
Hi all,

I thought it would be worth writing a quick explainer of what I have in mind, that I'm working towards with AD in TTML2. Thoughts very welcome! If this is worthwhile, we should create a GitHub repository for the group and pop it in there so we can edit it and track those edits. Let me know what you think!

Kind regards,

Nigel


Audio Description Explainer
Introduction

The goal is to be able to deliver audio description script, pre-recorded audio and mixing data in a single file with an open standard format, that can also use text to speech for potential client-side audio rendering later.

There is work in TTML2<https://w3c.github.io/ttml2/index.html> to define better constructs for representing the main requirements - continuous animation, pan and gain for mixing, speech rate and pitch for text to speech. This is implementable in browsers using Web Audio<https://www.w3.org/TR/webaudio/> and Web Speech<https://w3c.github.io/speech-api/speechapi.html> respectively, though the latter needs some work. There is also an interested bunch of people in the W3C Audio Description Community Group<https://www.w3.org/community/audio-description/> who would support creation of an open standard format meeting the requirements<https://github.com/w3c/ttml2/wiki/Audio-Description-Requirements> agreed for TTML2..

In the end, we should be able to deliver an AD profile of a TTML2 file to clients which provide real time mixing of the AD, perhaps with some user customisation (like changing the relative volumes of the main programme audio and the AD audio), or even presentation of the AD script text on a completely different device, like a braille display. Those clients might be hosted server side to create “broadcaster mix” or genuinely on the client to create “receiver mix”. Obviously presentation on a braille display and customisation do require the client side player to be used.

File Format

The file format should look something like:

<?xml version="1.0" encoding="UTF-8"?>

<tt xmlns="http://www.w3.org/ns/ttml" xmlns:ttd="http://www.w3.org/ns/ttml#datatype" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:tta="http://www.w3.org/ns/ttml#audio" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en">

  <body>
    <div>
      <audio src=";track=1" tta:pan="-1"/>
      <audio src=";track=2" tta:pan="1"/>

      <p xml:id="ad21b" begin="5.48s" end="19.44s" tta:gain="0.25">
        <animate begin="0.0s" end="0.12s" tta:gain="1;0.39"/>
        <animate begin="13.84s" end="13.96s" tta:gain="0.39;1"/>
        <span begin="0.12s" end="13.84s" tta:gain="0.25">
          <audio src="DRAD182Y01.wav" clipBegin="11.6s" clipEnd="24.32s"/>
          BBC Eastenders written by Colin Wyatt starring June Brown as Dot, John Altman as Nick, Declan Bennett as Charlie and Samantha Womack as Ronnie.</span>
      </p>

      <p xml:id="ad31b" begin="30.56s" end="32.84s" tta:gain="0.25">
        <animate begin="0.0s" end="0.12s" tta:gain="1;0.39"/>
        <animate begin="2.16s" end="2.28s" tta:gain="0.39;1"/>
        <span begin="0.12s" end="2.16s" tta:gain="0.25">
          <audio src="DRAD182Y01.wav" clipBegin="35.68s" clipEnd="37.72s"/>
          Nick takes a drag of his cigarette.</span>
      </p>

      <p xml:id="ad41b" begin="49.32s" end="51.16s">
        <animate begin="0.0s" end="0.12s" tta:gain="1;0.39"/>
        <animate begin="1.72s" end="1.84s" tta:gain="0.39;1"/>
        <span begin="0.12s" end="1.72s">
          <audio src="DRAD182Y01.wav" clipBegin="54.44s" clipEnd="56.04s"/>
          Nick gets up.</span>
      </p>

      <p xml:id="ad51b" begin="54.92s" end="57.08s">
        <animate begin="0.0s" end="0.12s" tta:gain="1;0.39"/>
        <animate begin="2.04s" end="2.16s" tta:gain="0.39;1"/>
        <span begin="0.12s" end="2.04s">
          <audio src="DRAD182Y01.wav" clipBegin="60.04s" clipEnd="61.96s"/>
          He grabs a knife.</span>
      </p>

      <p xml:id="ad61b" begin="62.24s" end="71.52s">
        <ttm:desc ttm:role="x-shotDescription">SHOT CHANGE: 62.36s Nick Cotton centre screen facing right.</ttm:desc>
        <animate begin="0.0s" end="0.12s" tta:gain="1;0.39"/>
        <animate begin="9.16s" end="9.28s" tta:gain="0.39;1"/>
        <span begin="0.12s" end="9.16s">
          <audio src=";track=2" tta:gain="0.25"/>
          Ronnie looks worried but he grabs a swiss roll from a carrier bag and roughly cuts off two slices offering her one on the end of a knife.</span
      </p>

      <p xml:id="ad71b" begin="79.2s" end="82.12s">
        <ttm:desc ttm:role="x-pronunciationNote">PRON: Koosh.</ttm:desc>
        <animate begin="0.0s" end="0.12s" tta:gain="1;0.39"/>
        <animate begin="2.8s" end="2.92s" tta:gain="0.39;1"/>
        <span begin="0.12s" end="2.8s">
          <audio src="DRAD182Y01.wav" clipBegin="84.32s" clipEnd="87.0s"/>
          Sonia leaves the Vic followed by Kush </span>
      </p>

      <p xml:id="ad91b" begin="115.16s" end="117.12s">
        <animate begin="0.0s" end="0.12s" tta:gain="1;0.39"/>
        <animate begin="1.84s" end="1.96s" tta:gain="0.39;1"/>
        <span begin="0.12s" end="1.84s">
          <audio src="DRAD182Y01.wav" clipBegin="120.28s" clipEnd="122.0s"/>
          At Dot's...</span>
      </p>

    </div>
  </body>
</tt>



Let's look at this in a bit more detail:

We have a div element that wraps all the other content. Crucially it includes two audio element children, which do a few things:

  *   They make that parent element an Audio generating element<https://w3c.github.io/ttml2/index.html#terms-audio-generating-element>, which means that the player code needs to create a Web Audio graph node for it.
  *   They tell the player to add ;track=1 (whatever that means) in and pan it all the way to the left, i.e. tts:pan="-1".
  *   They tell the player to add ;track=2 (whatever that means) in and pan it all the way to the right, i.e. tts:pan="1".

We need a convention to identify "tracks that are provided to us from somewhere else", and in this case we've defined ;track=n to do that.

Then there are bunch of child p elements that each have a begin and end time. They each represent a snippet of audio description and the time during which some stuff happens. The text of the audio description is contained in a child span element, which itself has begin and end times. The span's begin and end times are relative to the parent p element's begin time.

You might see that there's some metadata there too, which might be helpful during the authoring process, for example.

We need a few things to happen for each snippet of audio description:

  1.  Fade down the programme audio level.
  2.  Play the audio description audio chunk, stereo panned to the right place.
  3.  Fade the programme audio back up

The fade up and down are both achieved by placing animate elements as children of the p element. They smoothly change ("continuously animate") the tta:gain value between values, in a semi-colon separated list, where the begin and end times of the animation are specified on the element and are relative to the parent p element's begin time. The audio that they modify is the audio that is available to that element, i.e. the programme audio that comes down to the p from the parent div (remember that specified some audio? This is where it goes).

Playing the audio description is done by adding a new audio child to the span. The playback begins in the presentation at the span's begin time, and the clipBegin and clipEnd mark the in and out points of the referenced audio resource to play, which is specified by the src attribute. If we wanted to specify a left/right pan value, we could do that by setting a tta:pan attribute on the audio element itself. Similarly we could vary the level of the audio by setting a tta:gain value.

This structure is implemented in the mixing code by constructing a Web Audio Graph, where the outputs of all the spans are, in the end, mixed together.
Received on Friday, 27 July 2018 11:21:47 UTC