RE: [media] handling multitrack audio / video from Masatomo Kobayashi on 2010-12-01 (public-html-a11y@w3.org from December 2010)

From: Masatomo Kobayashi <MSTM@jp.ibm.com>
Date: Wed, 1 Dec 2010 22:53:07 +0900
To: public-html-a11y@w3.org
Message-ID: <OF1017202B.03A153BB-ON492577EC.004AEE3D-492577EC.004C467E@jp.ibm.com>
Could we discuss in detail the case where each description is stored 
discretely, particularly where the association between recorded and 
textual descriptions is needed?

The SMIL-like syntax seems to mean that timing information is embedded in 
HTML directly, in contrast to the case of timed text tracks where timing 
information is specified in external files (TTML, WebSRT, etc.).

This inconsistency would make it difficult to associate recorded 
descriptions with corresponding textual descriptions. It might cause 
problems when recorded descriptions have alternative textual descriptions 
for the Braille display; when descriptions are partly recorded (e.g., for 
special proper names) while partly synthesized; and when the "two-phase 
production strategy" is taken where at first textual descriptions are 
provided temporarily and then high-quality recorded descriptions are 
produced for better experiences.

For this purpose, existing standards seem to be available. For example, 
the SSML 'audio' element might be used with TTML:

<p xml:id="description1" begin="15s" dur="5s">
  <ss:audio src="description1.ogg">
    This is the 1st description.
  </ss:audio>
</p>

The CSS 'content' property could be used with WebSRT:

1
00:00:15,000 --> 00:00:20,000
<1>This is the 1st description.

::cue-part(1) { content: url(description1.ogg); }

Sean's TTML demo seems to consider this association issue using 'ms:audio' 
attributes, but I am not sure if this topic has been discussed in the task 
force. As two different methods, recorded and synthesized, are available 
for video descriptions, the consistency between them could be considered. 
This would be an issue specific to descriptions because only descriptions 
have two options; captions are always provided as text and sign languages 
are always provided as a video.

Regards,
Masatomo



public-html-a11y-request@w3.org wrote on 2010/11/02 06:52:25:
> 
> Comments below.
> Geoff/NCAM
> ________________________________________
> From: public-html-a11y-request@w3.org [public-html-a11y-
> request@w3.org] On Behalf Of Philip Jägenstedt [philipj@opera.com]
> Sent: Monday, November 01, 2010 4:52 AM
> To: public-html-a11y@w3.org
> Subject: Re: [media] handling multitrack audio / video
> 
> On Mon, 01 Nov 2010 02:14:25 +0100, Silvia Pfeiffer
> <silviapfeiffer1@gmail.com> wrote:
> 
> > On Fri, Oct 29, 2010 at 2:08 AM, Philip Jägenstedt <philipj@opera.com>
> > wrote:
> >> On Thu, 28 Oct 2010 14:46:32 +0200, Geoff Freed 
<geoff_freed@wgbh.org>
> >> wrote:
> >>
> >>> On Thu, 28 Oct 2010 13:05:57 +0200, Philip Jägenstedt
> >>> <philipj@opera.com>
> >>> wrote:
> >>>>
> >>>> It's
> >>>> beyond this most basic case I'd like to understand the actual use
> >>>> cases.
> >>>> To clarify, option 2 would allow things like this, borrowing SMIL
> >>>> syntax
> >>>> as seen in SVG:
> >>>>
> >>>> <video id="v" src="video.webm"></video>
> >>>> <video begin="v.begin+10s" src="video2.webm"></video>
> >>>> <!-- video and video2 should be synchronized with a 10s offset -->
> >>>>
> >>>> or
> >>>>
> >>>> <video id="v" src="video.webm"></video>
> >>>> <video begin="v.end" src="video2.webm"></video>
> >>>> <!-- video and video2 should play gapless back-to-back -->
> >>>>
> >>>> Are there compelling reasons to complicate things to this extent? 
The
> >>>> last example could be abused to achieve gapless playback between
> >>>> chunks in a
> >>>> HTTP live streaming setup, but I'm not a fan of the solution 
myself.
> >>>
> >>> I think there are compelling cases which are likely to occur in
> >>> production
> >>> environment because they are more efficient than the example I 
outlined
> >>> above.  For example, an author could store the same three 
descriptions
> >>> discretely, rather than in a single audio file, and then fire each 
one
> >>> at
> >>> the appropriate point in the timeline, in a manner similar to the 
one
> >>> you've
> >>> noted above:
> >>>
> >>> <video id="v" src="video.webm"></video>
> >>> <audio sync="v.begin+15s" src="description1.webm"></audio>
> >>> <audio sync="v.begin+30s" src="description2.webm"></audio>
> >>> <audio sync="v.begin+45s" src="description3.webm"></audio>
> >>
> >> Rights, it's easy to see how it could be used. If the implementation
> >> cost is
> >> worth what you get, I expect that similar implementations already 
exist
> >> in
> >> desktop applications. Are there any implementations of such a system 
in
> >> widespread use and does it actually get the sync right down to the
> >> sample?
> >
> >
> > Jeroen from JWPlayer/Longtail Video has implemented something for
> > audio descriptions, where audio descriptions come in separate files
> > and are synchronized through markup - I believe the synchronization is
> > done in the JWplayer in Flash, see
> > http://www.longtailvideo.com/support/addons/audio-description/
> 15136/audio-description-reference-guide
> > . AFAIK this is the most used platform for providing audio
> > descriptions on the Web at this point in time - I've seen it use in
> > government Websites around the globe.
> >
> > If it can be done in Flash in an acceptable quality, I would think
> > browsers should be able to do it. I can ask Jeroen for more
> > implementation details if necessary - AFAIK he said there was frequent
> > re-synchronization of the secondary resource to the main resource,
> > which continues playback at its existing speed.
> 
> It sounds like perfect sync isn't being achieved and that scripting of
> some sort is being used to get approximately the right sync. That's
> already possible today with <audio> and <video>, as I'm sure you know. 
If
> that works well enough, that would really speak against requiring 
perfect
> sync if it's difficult to implement (and it is).
> 
> ======
> GF
> I think that depends on what "approximate" means.  In most cases, a 
> description playing a few frames earlier or later than the author 
> intended probably won't be a deal-breaker.  Anything more than that 
> might present a problem, depending on the context.  For example, if 
> the author has written and timed a two-second description (a non-
> extended description) to play within a 2.1-second pause, the 
> description sync being more than a few frames off might mean that 
> the description collides with the program audio.
Received on Wednesday, 1 December 2010 14:31:06 UTC