RE: [media] handling multitrack audio / video from Sean Hayes on 2010-12-01 (public-html-a11y@w3.org from December 2010)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Wed, 1 Dec 2010 19:07:01 +0000
To: Masatomo Kobayashi <MSTM@jp.ibm.com>, "public-html-a11y@w3.org" <public-html-a11y@w3.org>
Message-ID: <8DEFC0D8B72E054E97DC307774FE4B91330C22BC@DB3EX14MBXC315.europe.corp.microsoft.c>
I think this is an interesting area. In my previous work on TTML descriptions (which is where the ms:audio attributes come from), whether using pre-recorded audio or generated; I have generally found pausing the video to play the description the most effective approach.

For non-paused situations I think you generally have a fixed gap in the existing soundtrack you need to hit; in that case, for generated speech, e.g. using SSML one can constrain the duration using the <prosody duration="Ns">, which might be more complex using CSS[1], for recorded audio of course you can control the duration of each recorded clip to a certain degree, but if there are too many words for the gap, you will have to deal with overlap, consequently perhaps the most effective means of providing pre-recorded non pausing descriptions is to provide a complete alternate soundtrack with them pre-mixed in.

Sean

[1] Applying audio content using CSS is an interesting idea, and would apply to TTML too in an HTML setting if it worked, does any browser actually support that?


From: public-html-a11y-request@w3.org [mailto:public-html-a11y-request@w3.org] On Behalf Of Masatomo Kobayashi
Sent: 01 December 2010 13:53
To: public-html-a11y@w3.org
Subject: RE: [media] handling multitrack audio / video

Could we discuss in detail the case where each description is stored discretely, particularly where the association between recorded and textual descriptions is needed?

The SMIL-like syntax seems to mean that timing information is embedded in HTML directly, in contrast to the case of timed text tracks where timing information is specified in external files (TTML, WebSRT, etc.).

This inconsistency would make it difficult to associate recorded descriptions with corresponding textual descriptions. It might cause problems when recorded descriptions have alternative textual descriptions for the Braille display; when descriptions are partly recorded (e.g., for special proper names) while partly synthesized; and when the "two-phase production strategy" is taken where at first textual descriptions are provided temporarily and then high-quality recorded descriptions are produced for better experiences.

For this purpose, existing standards seem to be available. For example, the SSML 'audio' element might be used with TTML:

<p xml:id="description1" begin="15s" dur="5s">
  <ss:audio src="description1.ogg">
    This is the 1st description.
  </ss:audio>
</p>

The CSS 'content' property could be used with WebSRT:

1
00:00:15,000 --> 00:00:20,000
<1>This is the 1st description..

::cue-part(1) { content: url(description1.ogg); }

Sean's TTML demo seems to consider this association issue using 'ms:audio' attributes, but I am not sure if this topic has been discussed in the task force. As two different methods, recorded and synthesized, are available for video descriptions, the consistency between them could be considered. This would be an issue specific to descriptions because only descriptions have two options; captions are always provided as text and sign languages are always provided as a video.

Regards,
Masatomo



public-html-a11y-request@w3.org<mailto:public-html-a11y-request@w3.org> wrote on 2010/11/02 06:52:25:
>
> Comments below.
> Geoff/NCAM
> ________________________________________
> From: public-html-a11y-request@w3.org<mailto:public-html-a11y-request@w3.org> [public-html-a11y-
> request@w3.org<mailto:request@w3.org>] On Behalf Of Philip Jägenstedt [philipj@opera.com]
> Sent: Monday, November 01, 2010 4:52 AM
> To: public-html-a11y@w3.org<mailto:public-html-a11y@w3.org>
> Subject: Re: [media] handling multitrack audio / video
>
> On Mon, 01 Nov 2010 02:14:25 +0100, Silvia Pfeiffer
> <silviapfeiffer1@gmail.com<mailto:silviapfeiffer1@gmail.com>> wrote:
>
> > On Fri, Oct 29, 2010 at 2:08 AM, Philip Jägenstedt <philipj@opera.com<mailto:philipj@opera.com>>
> > wrote:
> >> On Thu, 28 Oct 2010 14:46:32 +0200, Geoff Freed <geoff_freed@wgbh.org<mailto:geoff_freed@wgbh.org>>
> >> wrote:
> >>
> >>> On Thu, 28 Oct 2010 13:05:57 +0200, Philip Jägenstedt
> >>> <philipj@opera.com<mailto:philipj@opera.com>>
> >>> wrote:
> >>>>
> >>>> It's
> >>>> beyond this most basic case I'd like to understand the actual use
> >>>> cases.
> >>>> To clarify, option 2 would allow things like this, borrowing SMIL
> >>>> syntax
> >>>> as seen in SVG:
> >>>>
> >>>> <video id="v" src="video.webm"></video>
> >>>> <video begin="v.begin+10s" src="video2.webm"></video>
> >>>> <!-- video and video2 should be synchronized with a 10s offset -->
> >>>>
> >>>> or
> >>>>
> >>>> <video id="v" src="video.webm"></video>
> >>>> <video begin="v.end" src="video2.webm"></video>
> >>>> <!-- video and video2 should play gapless back-to-back -->
> >>>>
> >>>> Are there compelling reasons to complicate things to this extent? The
> >>>> last example could be abused to achieve gapless playback between
> >>>> chunks in a
> >>>> HTTP live streaming setup, but I'm not a fan of the solution myself.
> >>>
> >>> I think there are compelling cases which are likely to occur in
> >>> production
> >>> environment because they are more efficient than the example I outlined
> >>> above.  For example, an author could store the same three descriptions
> >>> discretely, rather than in a single audio file, and then fire each one
> >>> at
> >>> the appropriate point in the timeline, in a manner similar to the one
> >>> you've
> >>> noted above:
> >>>
> >>> <video id="v" src="video.webm"></video>
> >>> <audio sync="v.begin+15s" src="description1.webm"></audio>
> >>> <audio sync="v.begin+30s" src="description2.webm"></audio>
> >>> <audio sync="v.begin+45s" src="description3.webm"></audio>
> >>
> >> Rights, it's easy to see how it could be used. If the implementation
> >> cost is
> >> worth what you get, I expect that similar implementations already exist
> >> in
> >> desktop applications. Are there any implementations of such a system in
> >> widespread use and does it actually get the sync right down to the
> >> sample?
> >
> >
> > Jeroen from JWPlayer/Longtail Video has implemented something for
> > audio descriptions, where audio descriptions come in separate files
> > and are synchronized through markup - I believe the synchronization is
> > done in the JWplayer in Flash, see
> > http://www.longtailvideo.com/support/addons/audio-description/
> 15136/audio-description-reference-guide
> > . AFAIK this is the most used platform for providing audio
> > descriptions on the Web at this point in time - I've seen it use in
> > government Websites around the globe.
> >
> > If it can be done in Flash in an acceptable quality, I would think
> > browsers should be able to do it. I can ask Jeroen for more
> > implementation details if necessary - AFAIK he said there was frequent
> > re-synchronization of the secondary resource to the main resource,
> > which continues playback at its existing speed.
>
> It sounds like perfect sync isn't being achieved and that scripting of
> some sort is being used to get approximately the right sync. That's
> already possible today with <audio> and <video>, as I'm sure you know. If
> that works well enough, that would really speak against requiring perfect
> sync if it's difficult to implement (and it is).
>
> ======
> GF
> I think that depends on what "approximate" means.  In most cases, a
> description playing a few frames earlier or later than the author
> intended probably won't be a deal-breaker.  Anything more than that
> might present a problem, depending on the context.  For example, if
> the author has written and timed a two-second description (a non-
> extended description) to play within a 2.1-second pause, the
> description sync being more than a few frames off might mean that
> the description collides with the program audio.
Received on Wednesday, 1 December 2010 19:07:41 UTC