Re : FW: associating video and audio transcripts… from Ingar Mæhlum Arntzen on 2015-05-18 (public-html-media@w3.org from May 2015)

From: Ingar Mæhlum Arntzen <ingar.arntzen@gmail.com>
Date: Mon, 18 May 2015 13:07:15 +0200
To: public-html-media@w3.org
Cc: Daniel Davis <ddavis@w3.org>, chaals@yandex-team.ru, public-web-and-tv <public-web-and-tv@w3.org>, public-webtiming@w3.org
Message-ID: <CAOFBLLoKveK8rtZRkGsMPiky3vhYPtjqEyi9cqOpFHgUaf5n+w@mail.gmail.com>
This in reply to
https://lists.w3.org/Archives/Public/public-html-media/2015May/0027.html as
feedback to the discussion around media transcripts
http://www.w3.org/WAI/PF/HTML/wiki/Full_Transcript.

Thank you Daniel Davies for bringing this to the attention of the
Multi-device Timing Community Group https://www.w3.org/community/webtiming/
 via the monthly webtv report
https://lists.w3.org/Archives/Public/public-web-and-tv/2015May/0005.html.

The discussion about audio and video transcripts is relevant for the
Multi-device Timing Community Group as it ties into what we think is a
larger discussion; How the Web supports composition of timed media sources
as well as time-sensitive UI components.

Currently, it seems the video element is supposed to grow into some kind of
hub for coordinated playback. The video element (with track elements) is
supposed to handle multiple media sources and recognize a series of new
media types such as subtitles, chapter information, transcripts or perhaps
advertisements, as exemplified by another recent discussion within the
Media Task Force
https://lists.w3.org/Archives/Public/public-web-and-tv/2015May/0001.html.
With these new demands follow also the demand for standardization of new
media types and possibly built-in support for corresponding visuals in the
video element.

This, in my view, is a limiting approach. The Multi-device Timing Community
Group is advocating an alternative model, where the video element is no
longer the master time-source of a media presentation, but a slave like any
other time-sensitive component. Instead, we are proposing the
HTMLTimingObject https://www.w3.org/community/webtiming/htmltimingobject/
 as a new, explicit time source for timed media. This approach has a number
of significant advantages and crucially supports the flexibility and
extensibility that we have come to expect from the Web.


   -

   Loose coupling - the only requirement for time-coordinated behavior is
   that components interface (take direction) from a shared timing object.
   -

   Better timing model (more precise and expressive).
   -

   No dependencies between components - in particular the video element may
   stay agnostic with respect to the existence of transcripts (yet they may be
   navigated together and still give the appearance of being tightly
   connected).
   -

   Keeps the video element as simple as possible (require only improved
   support for timed playback)
   -

   Easy to make presentations from multiple video clips (in parallell or
   sequentially)
   -

   Timed data sources may be managed independently
   -

   Timed UI components may be defined and loaded independently
   -

   Custom visualizations are easy to make using the proposed
   HTMLSequencerObject (i.e. improved TrackElement)
   https://www.w3.org/community/webtiming/2015/05/12/proposal-htmlsequencerobject/

   -

   Immediate support for multi-device playback through integration of
   HTMLTimingObject with Shared Motion.


Further information about this model for timed composition is available at
the Multi-device Timing Community group as well as in this paper :
https://www.w3.org/community/webtiming/2015/05/08/linearcomposition/


Another (in my view distinct) issue in the transcript discussion is about
the expression of semantic relations. E.g., allowing a crawler to deduce
which transcript belong to which video etc. I would recommend that the
scope for this discussion was broadened as well. Transcripts are not the
only thing. How about timed geo-positions, or perhaps timed comments, or
timed “likes” ? Personally I’ve recently made an HTML based chess board
widget that is synchronized with a chess video. This particular widget
works on timed chess moves, timed piece drags, timed square and border
highlights, and timed canvas drawings for analysis. The point being that
different applications will define different kinds of timed data.

In my view, the general problem with respect to semantic relations is how
to describe the structure of a media presentation that is composed from a
possibly large number of independent (and possibly custom) timed data
sources, as well as UI components. In this perspective, adding new tags or
attributes to the video element may not seem like a very attractive
solution, at least not in the long run. Maybe there could be something like
a media manifest to do this? This manifest could even be a document in its
own right - hosted on a web server, and referenced by various components
from different web pages. An online media manifest would presumably
simplify work for crawlers and also emphasize the online nature
multi-device linear media.


In summary, I think improved Web support for temporal composition is a key
opportunity for Web media, and I hope the Media Task Force could use this
occasion to promote the importance of this topic.



Best regards,


Ingar Arntzen, Chair Multi-device Timing Community Group
Received on Monday, 18 May 2015 11:07:47 UTC