RE: TT WG Primary Goal - Authoring Exchange Format from David Kirby on 2003-02-24 (public-tt@w3.org from February 2003)

From: David Kirby <david.kirby@rd.bbc.co.uk>
Date: Mon, 24 Feb 2003 16:52:24 +0000
To: public-tt@w3.org
Message-Id: <3.0.5.32.20030224165224.00f156f0@pop3>
If the TT work is now to focus on an authoring format rather than
streaming, then, based on our current usage of xml for subtitle authoring
(in the BBC), I have several requirements to suggest.

The markup needs to be capable of carrying timing data and other
information about the original text, down to individual word level.  Where
the text is the scripted dialogue (or similar), these timings should refer
to the audio track and hence should be in seconds (millisecond resolution
is convenient). 

Where we're considering subtitle authoring, we also need to identify who is
speaking as this is needed where a text colour has to be assigned to each
speaker. Just giving the text a colour doesn't work, the individual
speakers need to be identified in the markup so that colour assignment can
be automated (or changed easily) via the speaker names. (The same reasoning
will apply in future for font details too.) This facility is essential for
authoring subtitles. If you don't use text colours or decide to change the
colour assignment, many of the subtitles created from the initial text
could change their grouping and layout. Put more generally, the text should
be associated with a speaker and a text-style linked to that speaker.

It is also convenient to have scene changes identified as the assignment of
text colours can be influenced by these.

If subtitles are to be created then the timings of the shot changes in the
video are also needed as these affect the way the original text is grouped
into subtitles.

Once the subtitles are created we have this additional data:
the grouping of the original words in each subtitle
in and out times
subtitle positioning and display data
foreground and background colours.

As the subtitles are created, the original word timings (in seconds) are
used to produce in- and out-times which are in frames, as they relate to
the video and are no longer locked rigidly to the audio track. In our case
we use frame numbers for these timings, with the first I-frame being frame
zero. When we subsequently produce the subtitle file for broadcast, we use
the user's timecode of the first I-frame as an offset to convert our frame
numbers into programme timecode. 

As a short example of what I mean, here are a few lines I've taken from the
xml file for the programme "Walking with Dinosaurs" which starts with the
words "Imagine you could travel back in time, to a time long before man." 

Firstly, our processing (using speech recognition) assigns a time to each
word, with time=0 being the start of the audio file, i.e. no timecodes or
frame numbers at this stage as this part has nothing to do with video. 

Our markup of this section is this: (the tag 'Audio' means spoken content) 
<Script>
<Scene>
<Title>Intro</Title>
<Speaker>
<Name>Narrator</Name>
<Audio id="0" time="48.100" end="48.300" >Imagine</Audio>
<Audio id="1" time="48.310" end="48.420">you</Audio>
<Audio id="2" time="48.480" end="48.600">could</Audio>
<Audio id="3" time="48.850" end="49.200">travel</Audio>
<Audio id="4" time="49.250" end="49.510">back</Audio>
...

Hence the first word is spoken 48.1 seconds from the start of the audio
track and ends at 48.3 seconds. (In an earlier version of our markup we
used durations rather than an end-time but found we were forever converting
to end times.)

Skipping ahead a little in the process, from this timed text (and shot
changes, etc.,) we create subtitles which are referenced to the video.
Timings are now in frames (Note that time=0 is not the same instant as
frame=0, as audio and video streams do not start together and we have to
compensate for this offset.)

The subtitle section in the xml file starts:
<Subtitles>
<Sub in="1198" out="1255">
<Line index="0">Imagine you could travel back in time</Line>
</Sub>
<Sub in="1256" out="1320">
<Line index="0">to a time long before man</Line>
</Sub>

The tag <Sub... identifies each subtitle and in and out are the frame
numbers between which the text is displayed.  The <Line... tag indicates
the line number in the subtitle, with index="0" indicating this is the
first, and in this example only, line of text in the subtitle. (I've
omitted the colour and display position details.)

Finally, elsewhere in the file is the line:
<Video file="\\serverName\Videos\walkingwithdinosaurs.mpg"
firstframe="899620"/>

which tells us the video file this was authored with and the value
firstframe is the user's timecode for the first I-frame. So add 899620 to
the first in- time of 1198 and we get an in-time of 900818 frames which is
10:00:32:18 at 25fps.

It's important to us to preserve the original ("audio-based") timings of
the words and these are not changed in any way as the subtitles are created
from them. This means that we can produce different styles of subtitles
(i.e two-liners, three-liners or Line 21 rather than UK teletext) by
re-running only the subtitle formatting stage, as necessary.

There are some of our key requirements for a TT markup for authoring. In
practice we need quite a few more details too but as this is getting a
little too long, I'll leave those for another time.

Regards,
David.
--
David Kirby
Project Manager
BBC Research and Development      
Kingswood Warren                  Tel: +44 1737 839623
Tadworth, Surrey.                 Fax: +44 1737 839665
KT20 6NP, UK.
Received on Monday, 24 February 2003 12:05:26 UTC