RE: [media] alt technologies for paused video (and using ARIA) from John Foliot on 2011-05-14 (public-html-a11y@w3.org from May 2011)

From: John Foliot <jfoliot@stanford.edu>
Date: Sat, 14 May 2011 10:43:04 -0700 (PDT)
To: "'David Singer'" <singer@apple.com>, "'HTML Accessibility Task Force'" <public-html-a11y@w3.org>
Cc: "'Jared Smith'" <jared@webaim.org>, "'E.J. Zufelt'" <everett@zufelt.ca>
Message-ID: <006601cc125e$5eeeb5f0$1ccc21d0$@edu>
David Singer wrote:
> 
> I'm going to try to clear up some of my own confusion here.
> 
> I think we might need three pieces of information linked to a media
> (video or audio) element:
> 
> * a short text (kinda like alt)
> * a long description
> * a transcript
> 
> in all cases, they should provide equivalents for what someone who can
> consume the media 'normally' would pick up.  (I think this is as true
> of audio as of video, by the way).

Hi David,

I agree, although the transcript is actually an asset that both "sighted"
and "non-sighted" users will often have a desire for.  I am not too
concerned about Silvia's proposal to introduce a new @transcript /
@transcription attribute (outside of the fact that I am fussy about
elements versus attributes, but that's for another argument, er,
discussion).


> 
> So, I was sort of right and sort-of wrong when I said that the short-
> text should not describe the poster, but the media.  I'm right, the
> element is more than the first frame or poster frame.  I'm wrong, in
> that the (in this case sighted) normal user would have gathered
> something from that initial frame.
> 
> so, not good:
> 
> <video poster="TheLeopard.jpg" short-text="A movie poster for The
> Leopard" src="..." />
> 
> because the sighted user will know it's a video element and that it's
> offering them the trailer.
> 
> Way better is to relay some of the information from the poster:
> 
> <video poster="TheLeopard.jpg" short-text="Trailer for The Leopard,
> starring Burt Lancaster" src="..." />

*IF* the author does indeed choose to use a movie poster as a first-frame
image choice. But despite its poor choice of name, the image referenced by
@poster today could be *any* image, including a pure-play branding image
("iTunes Theater Presents: The Leopard", where the imagery would be
partially stock or specially commissioned imagery including the iTunes
"logo", the sell line as imbedded display font, promotional movie stills,
etc.) - in this case not only do we need a short textual description about
the <video> - "Trailer for The Leopard, starring Burt Lancaster", we also
need to provide the non-sighted user with the actual text burned into the
image proper, and ideally a description of what that imagery is. 

In your example here, while the short-text value of "Trailer for The
Leopard, starring Burt Lancaster" is indeed a short description of the
video asset (the principle attribute of the <video> element, referenced by
the src attribute), it conveys none of the information in the
author-selected first-frame: 

	"An image of two film cans with Apples embossed upon them propped
beside a film projector, and the text "iTunes Theater Presents: The
Leopard""

(see how you can actually visualize that?...)

Earlier this week Leonie Watson summed it up quite clearly:

	"When I arrive at a video (with my screen reader), I want to know
what that static image/frame contains. At that moment in time, in the
world according to me and my screen reader, that image exists entirely in
its own right. It might be a still from the video, it might be a separate
image. It might be related content, it might be a completely unrelated
corporate ident (for example).

        Wanting to know what that image contains doesn't prevent me from
wanting to know what the video contains. There may well be overlap, but
equally they could be worlds apart."


> 
> the long description can provide a more narrative version of the
> trailer, and the transcript a full transcript. 

At this time, the one thing that we all seem to be relatively in agreement
on is that this particular requirement would most likely be handled by
aria-describedby. 

	<video src="..." aria-describedby="synopsis"></video>
	<p id="synopsis"> The Prince of Salina, a noble aristocrat of
impeccable integrity, tries to preserve his family and class amid the
tumultuous social upheavals of 1860's Sicily.</p>


The assumption is that most videos will have *some* associated text
describing something about the movie for sighted users on the same page,
so that should be linked by aria-describedby (it is dangerous to make
assumptions, true, however...). In the case where there is *no* on-screen
description of the movie, then there would be no 'non-visual' description
either - what's good for the goose is good for the gander, as my
grandmother used to say.



> This way the short text
> is enabling the non-sighted user just like the sighted one:
> sighted: see poster, decide it's interesting, watch trailer
> non-sighted: get the short-text, decide it's interesting, read the long
> description and/or transcript
> 
> (I'm using non-sighted as a shorthand for someone who, for whatever
> reason, can't see the video - their eyes are busy elsewhere, their UA
> is unable to play it, and so on.  Hope that's OK).

No Problem by me, we need to speak plainly sometimes, and I don't think
you are being misunderstood here.

***************

While on the topic of Plain Speaking:

As a sighted user, I am always very careful when making assumptions and
assertions on behalf of daily screen reader users. True enough that after
over a decade of being an accessibility specialist I should have a pretty
clear comprehension of the big picture, but still, I routinely discuss
scenarios with a number of trusted blind users to be sure I have not
strayed off track. 

Two such colleagues that I have previously discussed this subject with are
Victor Tsaran (a daily screen reader user), who runs Yahoo!'s
accessibility lab in Santa Clara and works with engineers and developers
of all stripes as they produce web content for millions of daily consumers
around the globe, and with Everett Zufelt, who has a CS Engineering
background, is a daily screen reader user as well, and (as an engineer)
has committed over 1,000 accessibility patches to the open source Drupal
CMS system (http://groups.drupal.org/node/117539) - which is to say, he is
smart, talented and gets it. Neither of them builds browsers or screen
readers, but both have a first-hand perspective on delivering content to
end users, both as content creators, as engineers supporting that content
delivery, and as content consumers.

After talking with Everett Friday, he wrote me an email, which he has
subsequently shared to this list yesterday
(http://lists.w3.org/Archives/Public/public-html-a11y/2011May/0356.html). 

I think that the critical thing that Everett has identified, that some
others continue to argue against, is that there are in fact 3 things here
that need to be conveyed to the non-sighted user, irrespective of their
source or how we code things up:

* The video player (initiated by the <video> element), which is the
bounding box (or canvas as Everett described it) and the controls
associated to the player (whether they are the browser's native controls,
or JavaScripted controls supplied by the author) - we need to ensure that
all roles, states and properties are accessibly delivered

* The 'video' itself - the media asset (which itself is a further
composite of imagery, movement, sound, and text) - we need to ensure that
each part of that composite asset has accessible alternatives

* The 'still' image (regardless of its source or specific content). 

Since both the video and the still image are non-textual objects, and WCAG
2 1.1 clearly states that *any* non-textual object requires textual
equivalents, it is abundantly clear that we need mechanisms to provide
that text for both the video and the still image. 


Silvia Pfeiffer wrote:
> 
> The point that nobody seems to understand is that there is no need to
> provide a text alternative for the video. All we need is a text
> alternative for the poster (read: placeholder image). The video's
> content is not presented at the time where a text alternative for the
> video *element* is needed.

Stating that Victor and Everett, both daily screen readers and working
engineers, "don't understand what they need" simply doesn't cut it - they
clearly *do* understand: they understand engineering, they understand the
web, they understand their AT tools and they understand their user
experience. The video is the video, the still imagery is the still
imagery, and both require a short textual alternative, as well as a longer
textual description if and when appropriate.

When Jared Smith, the Associate Director of WebAIM ("...and if anyone
should know the best way it should be WebAIM." - Silvia Pfeiffer) writes
to this list
(http://lists.w3.org/Archives/Public/public-html-a11y/2011May/0322.html)
and also confirms that the users and content authors that he interacts
with daily have these requirements, or Leonie Watson, the Director of
Accessibility at Nomensa, a leading UK-based web agency with clients such
as P&G, Virgin, Nottingham University, the UK Treasury & UK Ministry of
Justice (and more), writes to also confirm these needs in her first person
voice
(http://lists.w3.org/Archives/Public/public-html-a11y/2011May/0301.html),
we must stop and ask, who is not really understanding?

You do not solve use-cases and user requirements by insisting they don't
exist. Arguments that we do not need both of these types of textual
descriptions cannot be accepted - such arguments are (for me at least) a
deal breaker - this is a hill I am prepared to die on. There are enough
voices of blind users and accessibility specialists on this list alone who
have made this statement of need that we simply cannot ignore their
request, regardless of how clear or confused those initial statements of
request were perhaps conveyed. If some contributors to this list cannot
understand why these requirements exists, I am sorry that we have not been
able to better explain why - it's not been for lack of trying. But it
reaches a point where, if you still do not understand the "why", you need
to trust those who are directly affected when they say they need
something, and figure out a way to deliver it, even if you still don't
fully understand why. 

It is my belief that until such time as we are in agreement on what *all*
of our actual needs are, we will continue to be talking about incomplete,
confusing or conflicting potential solutions. Before proposing aria-label
or @title be shoe-horned in there somewhere for "alt technologies for
video", let's be very clear what we are providing alternative texts for,
and then we can look to effectively deliver those solutions.

JF
Received on Saturday, 14 May 2011 17:43:34 UTC