- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Tue, 11 Dec 2012 20:24:06 +1100
- To: public-texttracks@w3.org
- Message-ID: <CAHp8n2kha04JqCtYrZSg01vsnM9h+nu5BSr1zVN2N=vxXbtTew@mail.gmail.com>
Hi all, The topic of live video and captioning has come up a few times before in bugs [1] and more recently on the topic of regions [2]. I'd like to separate out two different topics here: "live broadcasting" and "live conversations". While both have a need for captions, "live broadcasting" can tolerate delaying the video by a sufficient amount of time to introduce caption cues. Good captions typically have a duration of about 1-5 sec depending on their length and number of lines of text - about 2 sec per line [3]. Thus, a live broadcast could get delayed by about 3 sec to allow for the creation and synchronized delivery of a new line of cue text. In contrast to "live broadcasting", such a delay cannot be tolerated in the "live conversations" use case where people communicate with each other live. A real-time captioner would join the call as another participant and type a transcript of the conversation not unlike a court stenographer types a transcript of court proceedings. In fact, this is very similar to how live captioning works right now on TV. It's also how captions work for Google Hangouts [4]. So, what is to be done about captions for live, real-time video? One example implementation of the use of WebVTT for live streaming is in Apple's HLS (HTTP Live Streaming) [5]. The way it is being used is by breaking the timeline of the video down into small chunks and delivering fully conformant WebVTT files for each such chunk of time with maybe one or two cues in it (see the advanced stream example in [6]). Since we know that caption cues typically cover a duration of about 3 sec, a cue-based delivery of captions will give the deaf person the typing delay plus the 3 sec delay on top of it, which makes it almost impossible for them to interact with their peers in real-time. Therefore, one requirement for captions in "live conversations" is that characters or at least words need to be transmitted to the peer(s) as soon as they have been typed. This is very similar to how real-time text [7] demos show the difference between transmitting full jabber messages in comparison to smaller packets with less text that get sent more frequently. To support a word-based transmission in WebVTT parlance would require to append a word to a cue every time a "space" character is typed, and to start a new cue every time the "enter" key is hit and a new line/cue is started. 1. Solution in HTML We can realize this by introducing an "append()" function for TextTrackCues in HTML. Then at least real-time caption functionality can be implemented in JavaScript. It could be something like: cue.append(text, duration, timestamp) Which would simply append the "text" to the cue's text content and "duration" to the cue's end time. It could even add a timestamp in front of the text to capture when the word was provided. 2. Solution in WebVTT It would also be nice to be able to do this in WebVTT directly, e.g. using HLS. Using regions, I can imagine the following approach: * every time a captioner hits the "space" bar, the word gets appended to a particular cue in a region, which is identifier through "append:regionID#cueID" * every time a captioner hits the "enter" key, a new cue is created and added to the region, identified through "region:regionID" * the default duration of a cue is 2 sec to give it sufficient time to stay on screen to be read * we use regions to allow past cues to stay on screen longer and automatically scroll up new cues Example: cue1 00:00:06.873 --> 00:00:08.873 region:reg1 I'M 00:00:06.974 --> 00:00:09.974 append:reg1#cue1 AT 00:00:07.030 --> 00:00:09.030 append:reg1#cue1 THE 00:00:07.104 --> 00:00:09.104 append:reg1#cue1 LEFT ... In this way it is possible to later post-process the files and merge the partial cues into one cue. Example: cue1 00:00:06.873 --> 00:00:09.104 region:reg1 I'M AT THE LEFT ... It could even continue to keep the temporal details using timestamps: Example: cue1 00:00:06.873 --> 00:00:09.104 region:reg1 I'M <00:00:06.974>AT <00:00:07.030>THE <00:00:07.104>LEFT ... These are all very initial thoughts on how we could possibly do captions for "live conversations" using WebVTT. Feedback and counter-suggestions more than welcome! Cheers, Silvia. [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=14104 / 18029 [2] http://lists.w3.org/Archives/Public/public-texttracks/2012Dec/0000.html [3] http://joeclark.org/access/captioning/CBC/images/CBC-captioning-manual-EN.pdf [4] https://plus.google.com/116514438961134149738/posts/6aKU4r7CNoY [5] http://tools.ietf.org/html/draft-pantos-http-live-streaming-10 [6] https://developer.apple.com/resources/http-streaming/examples/ [7] http://www.marky.com/realjabber/real_time_text_demo.html
Received on Tuesday, 11 December 2012 09:30:18 UTC