- From: David Kirby <david.kirby@rd.bbc.co.uk>
- Date: Mon, 24 Feb 2003 16:52:24 +0000
- To: public-tt@w3.org
If the TT work is now to focus on an authoring format rather than streaming, then, based on our current usage of xml for subtitle authoring (in the BBC), I have several requirements to suggest. The markup needs to be capable of carrying timing data and other information about the original text, down to individual word level. Where the text is the scripted dialogue (or similar), these timings should refer to the audio track and hence should be in seconds (millisecond resolution is convenient). Where we're considering subtitle authoring, we also need to identify who is speaking as this is needed where a text colour has to be assigned to each speaker. Just giving the text a colour doesn't work, the individual speakers need to be identified in the markup so that colour assignment can be automated (or changed easily) via the speaker names. (The same reasoning will apply in future for font details too.) This facility is essential for authoring subtitles. If you don't use text colours or decide to change the colour assignment, many of the subtitles created from the initial text could change their grouping and layout. Put more generally, the text should be associated with a speaker and a text-style linked to that speaker. It is also convenient to have scene changes identified as the assignment of text colours can be influenced by these. If subtitles are to be created then the timings of the shot changes in the video are also needed as these affect the way the original text is grouped into subtitles. Once the subtitles are created we have this additional data: the grouping of the original words in each subtitle in and out times subtitle positioning and display data foreground and background colours. As the subtitles are created, the original word timings (in seconds) are used to produce in- and out-times which are in frames, as they relate to the video and are no longer locked rigidly to the audio track. In our case we use frame numbers for these timings, with the first I-frame being frame zero. When we subsequently produce the subtitle file for broadcast, we use the user's timecode of the first I-frame as an offset to convert our frame numbers into programme timecode. As a short example of what I mean, here are a few lines I've taken from the xml file for the programme "Walking with Dinosaurs" which starts with the words "Imagine you could travel back in time, to a time long before man." Firstly, our processing (using speech recognition) assigns a time to each word, with time=0 being the start of the audio file, i.e. no timecodes or frame numbers at this stage as this part has nothing to do with video. Our markup of this section is this: (the tag 'Audio' means spoken content) <Script> <Scene> <Title>Intro</Title> <Speaker> <Name>Narrator</Name> <Audio id="0" time="48.100" end="48.300" >Imagine</Audio> <Audio id="1" time="48.310" end="48.420">you</Audio> <Audio id="2" time="48.480" end="48.600">could</Audio> <Audio id="3" time="48.850" end="49.200">travel</Audio> <Audio id="4" time="49.250" end="49.510">back</Audio> ... Hence the first word is spoken 48.1 seconds from the start of the audio track and ends at 48.3 seconds. (In an earlier version of our markup we used durations rather than an end-time but found we were forever converting to end times.) Skipping ahead a little in the process, from this timed text (and shot changes, etc.,) we create subtitles which are referenced to the video. Timings are now in frames (Note that time=0 is not the same instant as frame=0, as audio and video streams do not start together and we have to compensate for this offset.) The subtitle section in the xml file starts: <Subtitles> <Sub in="1198" out="1255"> <Line index="0">Imagine you could travel back in time</Line> </Sub> <Sub in="1256" out="1320"> <Line index="0">to a time long before man</Line> </Sub> The tag <Sub... identifies each subtitle and in and out are the frame numbers between which the text is displayed. The <Line... tag indicates the line number in the subtitle, with index="0" indicating this is the first, and in this example only, line of text in the subtitle. (I've omitted the colour and display position details.) Finally, elsewhere in the file is the line: <Video file="\\serverName\Videos\walkingwithdinosaurs.mpg" firstframe="899620"/> which tells us the video file this was authored with and the value firstframe is the user's timecode for the first I-frame. So add 899620 to the first in- time of 1198 and we get an in-time of 900818 frames which is 10:00:32:18 at 25fps. It's important to us to preserve the original ("audio-based") timings of the words and these are not changed in any way as the subtitles are created from them. This means that we can produce different styles of subtitles (i.e two-liners, three-liners or Line 21 rather than UK teletext) by re-running only the subtitle formatting stage, as necessary. There are some of our key requirements for a TT markup for authoring. In practice we need quite a few more details too but as this is getting a little too long, I'll leave those for another time. Regards, David. -- David Kirby Project Manager BBC Research and Development Kingswood Warren Tel: +44 1737 839623 Tadworth, Surrey. Fax: +44 1737 839665 KT20 6NP, UK.
Received on Monday, 24 February 2003 12:05:26 UTC