W3C home > Mailing lists > Public > public-html-a11y@w3.org > April 2012

RE: video and long text descriptions / transcripts

From: John Foliot <john@foliot.ca>
Date: Thu, 5 Apr 2012 10:05:51 -0700
To: "'Silvia Pfeiffer'" <silviapfeiffer1@gmail.com>, "'David Singer'" <singer@apple.com>
Cc: "'HTML Accessibility Task Force'" <public-html-a11y@w3.org>
Message-ID: <006701cd134e$5c8ffed0$15affc70$@ca>
Silvia Pfeiffer wrote:

> On Thu, Apr 5, 2012 at 5:16 AM, David Singer <singer@apple.com> wrote:
> >
> > On Mar 30, 2012, at 14:52 , Silvia Pfeiffer wrote:
> >>
> >> We keep talking about "long text descriptions for videos" and
> >> "transcripts" as separate things. There is an implied assumption
> that
> >> we need two different solutions for these, which I would like to
> >> challenge.

Sorry I have not been able to participate more fully up until now, but with
a household move this past weekend, I am now only digging out.

Silvia, I would like to ask you what you believe the "longer textual
description" does for non-sighted users, and why authors should be providing
this information. You seem to be very strongly coming from a perspective of
"literalism", where you believe that the transcript is somehow the
equivalent of a long description. It isn't.

When I speak of a longer textual description, I differentiate it from an
Accessible Name (AccName) in the Accessibility APIs, which is the short
textual description (This is a movie, it's name is "A Clockwork Orange"). We
don't have a native HTML5 means of applying an AccName to the video element
today, although as previously noted we can use either aria-label or

When we look at a longer textual description, what I am looking for is
something that would map to the Accessible Description (or, to be even more
precise, the equivalent of the MSAA AccessibleDescription Property).  That
MSAA property is defined as:

	"An object's AccessibleDescription property provides a textual
description about an object's visual appearance. The description is
primarily used to provide greater context for low-vision or blind users, but
can also be used for context searching or other applications.

	The AccessibleDescription property is needed if the description is
not obvious, or if it is redundant based on the object's AccessibleName,
AccessibleRole, State, and Value properties. For example, a button with "OK"
would not need additional information, but a button that shows a picture of
a cactus would. The AccessibleName, and AccessibleRole (and perhaps Help)
properties for the cactus button would describe its purpose, but the
AccessibleDescription property would convey information that is less
tangible, such as "A button that shows a picture of a cactus.""

Clearly, and for truth, that is NOT a transcript, which you have defined
(correctly IMHO) as:

> * a full transcription of everything happening in the video, including
> a transcript of all dialogs and the important visual bits

If we continue to work from the presumption that a Transcript is the
"caption file" minus the time-stamping aspect (are we in agreement here?),
then this also aligns closely to what a "movie caption" is, as defined by
the DCMP Captioning Key here:

	"Captioning is the process of converting the audio content of a
television broadcast, webcast, film, video, CD-ROM, DVD, live event, or
other productions into text and displaying the text on a screen or monitor.
Captions not only display words as the textual equivalent of spoken dialogue
or narration, but they also include speaker identification, sound effects,
and music description."
[source: http://www.dcmp.org/captioningkey]

In the case of a video that runs to 60, 90, 120 minutes, that transcript
file could run to hundreds of [printed] pages and is most clearly *NOT* "...
a textual description about an object's visual appearance"

> But really: who is going to produce all these things?

Content providers who must, or who care to. This should not be one of our
concerns, we should only be ensuring that if they do, they have a means of
applying "a textual description about" the video in question.

> And which one is
> the best for a deaf-blind user to have?

While I appreciate your consideration for this particular user-group, I
think you are casting your net at too narrow a group of users: any
non-sighted user would appreciate having a longer textual description of a
lengthy video without having to wade through a book's worth of text file
prior to watching (listening to) a video (complete with described

> Certainly the answer is that a
> full transcription of everything being said and all the scene
> descriptions is the best that a deaf-blind user can have and also the
> most complete text representation of the video. I therefore call this
> "the optimal long description document". 

And I call it the "Transcript", which does not meet the definition of the
Accessible.Description property as defined by the Accessibility APIs.

> The point I am trying to make is not at the detail level. It is at the
> macro level. How useful is it for a deaf-blind user to be presented
> with a number of documents that they could read that provide some form
> of "long description" for them? How useful is it to be several docs
> rather than a single one? Preferably it should just be one document
> and the one chosen should be the most inclusive one, the one that has
> the best description of them all, and that one is what we have this
> far discussed as "the transcript".
> Do you disagree?

Yes. Quite strongly in fact.

> > b) authors are unlikely to provide both, however
> Yes, that is one of the things on my mind, too. This is why I don't
> think it makes much sense to have both a @transcript and a @longdesc
> attribute on the video: if we have an actual transcript, it would be
> the same document behind both attributes and if we don't have on, we'd
> have a url behind the longdesc and none behind the transcript. In both
> these situations, the @transcript attribute is not useful.

With due respect, you are looking at this from the perspective of either the
implementer or the author, and not the end user. I cannot think of any
end-user, who, when wanting to know which version of a video they are about
to consume, will first "read the book" - this is simply out of alignment
with reality.

We have (it seems to me) 2 problems here: 

1) 'defining' what a longer textual description actually is, who it is for
and the role it serves (a.k.a. the difference between what I am talking
about and "the transcript"), and 

2) the programmatic means that we link these various textual documents to
the <video> element.  I proposed @transcript, but if a better solution comes
along, I am all ears and open to investigating it (and I note that I've seen
Ted's draft counter-proposal to Issue 194, but have not had time to digest
it yet).

> > c) the transcript/description should be part of the 'normal DOM'
> The text itself? That's how it could be provided, but why is that a
> requirement? Why is it not acceptable that the text is in another Web
> resource?

I agree with Silvia here - in fact given the length of some transcripts, it
would be massively intrusive if the "transcript" travelled with the video
asset every time: it should most definitely be an "on-demand" asset.

> > d) the relationship should be discoverable by anyone, not just
> accessibility tools
> I agree. It definitely has to be exposed by the video element to AT.
> In addition, there needs to be a visual presentation. There are
> several ways to get this: one is a visual indication on the video
> element which is provided through the shadow DOM, another is through a
> visual indication somewhere else in the browser (e.g. the URL is
> exposed on mouseover and a CTRL+ENTER click can activate it), and the
> last is that it's a separate DOM element on the screen that is
> programmatically linked.

...and or it could be accessed via the contextual menu of the video player
itself. I agree with the requirement, but I think we can also be less strict
on  how the asset is made discoverable (although I would also agree that
some kind of default be suggested/proposed).

> > e) they should use a common mechanism to link the media to its
> transcript/description etc.
> I disagree with this requirement.

Well.... I continue to have an issue with conflating the 2 separate items,

I think that each item (transcript and longer textual description) should
have a dedicated means of programmatically linking each to its parent
<video>, but I am open to exploring ideas to achieve both. Continuing to
insist that both are the same thing however is a non-starter for me.

> A long description for the purposes
> of deaf-blind users has to be discoverable when focused upon the video
> element.

If the longer textual description were *only* for deaf-blind users, perhaps.
But that is not the role of the longer textual description, nor the only
target user-group.

> Other related content such as interactive transcripts,
> scripts, and other video metadata only has to live nearby the video
> and be discoverable when moving around the page. I don't see a need
> for a programmatic association of those with the video other than what
> @describedBy already offers.

Note that you can only apply aria-describedby once to an element, so if you
are hoping to use it for both 'interactive transcripts' *AND* other video
metadata (and I've already expressed my concern over the use of that
specific term), then you will be out of luck - it's an either/or choice you
have. All the more reason to fully define and understand what all of the
different types of textual content we might have will be, and the role that
each of those different types (and files) serve to all users.

Received on Thursday, 5 April 2012 17:06:25 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:05:27 UTC