Captioning for live video from Silvia Pfeiffer on 2012-12-11 (public-texttracks@w3.org from December 2012)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Tue, 11 Dec 2012 20:24:06 +1100
To: public-texttracks@w3.org
Message-ID: <CAHp8n2kha04JqCtYrZSg01vsnM9h+nu5BSr1zVN2N=vxXbtTew@mail.gmail.com>
Hi all,

The topic of live video and captioning has come up a few times before in
bugs [1] and more recently on the topic of regions [2].

I'd like to separate out two different topics here: "live broadcasting"
and "live conversations".

While both have a need for captions, "live broadcasting" can tolerate
delaying the video by a sufficient amount of time to introduce caption
cues. Good captions typically have a duration of about 1-5 sec depending on
their length and number of lines of text - about 2 sec per line [3]. Thus,
a live broadcast could get delayed by about 3 sec to allow for the creation
and synchronized delivery of a new line of cue text.

In contrast to "live broadcasting", such a delay cannot be tolerated in the
"live conversations" use case where people communicate with each other
live. A real-time captioner would join the call as another participant and
type a transcript of the conversation not unlike a court stenographer types
a transcript of court proceedings. In fact, this is very similar to how
live captioning works right now on TV. It's also how captions work for
Google Hangouts [4].

So, what is to be done about captions for live, real-time video?

One example implementation of the use of WebVTT for live streaming is in
Apple's HLS (HTTP Live Streaming) [5]. The way it is being used is by
breaking the timeline of the video down into small chunks and delivering
fully conformant WebVTT files for each such chunk of time with maybe one or
two cues in it (see the advanced stream example in [6]).

Since we know that caption cues typically cover a duration of about 3 sec,
a cue-based delivery of captions will give the deaf person the typing delay
plus the 3 sec delay on top of it, which makes it almost impossible for
them to interact with their peers in real-time.

Therefore, one requirement for captions in "live conversations" is that
characters or at least words need to be transmitted to the peer(s) as soon
as they have been typed. This is very similar to how real-time text [7]
demos show the difference between transmitting full jabber messages in
comparison to smaller packets with less text that get sent more frequently.

To support a word-based transmission in WebVTT parlance would require to
append a word to a cue every time a "space" character is typed, and to
start a new cue every time the "enter" key is hit and a new line/cue is
started.


1. Solution in HTML

We can realize this by introducing an "append()" function for TextTrackCues
in HTML. Then at least real-time caption functionality can be implemented
in JavaScript.

It could be something like:

cue.append(text, duration, timestamp)

Which would simply append the "text" to the cue's text content and
"duration" to the cue's end time. It could even add a timestamp in front of
the text to capture when the word was provided.


2. Solution in WebVTT

It would also be nice to be able to do this in WebVTT directly, e.g. using
HLS. Using regions, I can imagine the following approach:

* every time a captioner hits the "space" bar, the word gets appended to a
particular cue in a region, which is identifier through
"append:regionID#cueID"
* every time a captioner hits the "enter" key, a new cue is created and
added to the region, identified through "region:regionID"
* the default duration of a cue is 2 sec to give it sufficient time to stay
on screen to be read
* we use regions to allow past cues to stay on screen longer and
automatically scroll up new cues

Example:

cue1
00:00:06.873 --> 00:00:08.873 region:reg1
I'M

00:00:06.974 --> 00:00:09.974 append:reg1#cue1
AT

00:00:07.030 --> 00:00:09.030 append:reg1#cue1
THE

00:00:07.104 --> 00:00:09.104 append:reg1#cue1
LEFT

...

In this way it is possible to later post-process the files and merge the
partial cues into one cue.

Example:

cue1
00:00:06.873 --> 00:00:09.104 region:reg1
I'M AT THE LEFT

...

It could even continue to keep the temporal details using timestamps:

Example:

cue1
00:00:06.873 --> 00:00:09.104 region:reg1
I'M <00:00:06.974>AT <00:00:07.030>THE <00:00:07.104>LEFT

...

These are all very initial thoughts on how we could possibly do captions
for "live conversations" using WebVTT. Feedback and counter-suggestions
more than welcome!

Cheers,
Silvia.


[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=14104 / 18029
[2] http://lists.w3.org/Archives/Public/public-texttracks/2012Dec/0000.html
[3]
http://joeclark.org/access/captioning/CBC/images/CBC-captioning-manual-EN.pdf
[4] https://plus.google.com/116514438961134149738/posts/6aKU4r7CNoY
[5] http://tools.ietf.org/html/draft-pantos-http-live-streaming-10
[6] https://developer.apple.com/resources/http-streaming/examples/
[7] http://www.marky.com/realjabber/real_time_text_demo.html
Received on Tuesday, 11 December 2012 09:30:18 UTC