Re: [media] alt technologies for paused video (and using ARIA) from Silvia Pfeiffer on 2011-05-15 (public-html-a11y@w3.org from May 2011)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Sun, 15 May 2011 12:45:14 +1000
To: John Foliot <jfoliot@stanford.edu>
Cc: David Singer <singer@apple.com>, HTML Accessibility Task Force <public-html-a11y@w3.org>, Jared Smith <jared@webaim.org>, "E.J. Zufelt" <everett@zufelt.ca>
Message-ID: <BANLkTim8oPeE+kkUzef8KocUcpgZHazQww@mail.gmail.com>
On Sun, May 15, 2011 at 3:43 AM, John Foliot <jfoliot@stanford.edu> wrote:
> David Singer wrote:
>>
>> I'm going to try to clear up some of my own confusion here.
>>
>> I think we might need three pieces of information linked to a media
>> (video or audio) element:
>>
>> * a short text (kinda like alt)
>> * a long description
>> * a transcript
>>
>> in all cases, they should provide equivalents for what someone who can
>> consume the media 'normally' would pick up.  (I think this is as true
>> of audio as of video, by the way).
>
> Hi David,
>
> I agree, although the transcript is actually an asset that both "sighted"
> and "non-sighted" users will often have a desire for.  I am not too
> concerned about Silvia's proposal to introduce a new @transcript /
> @transcription attribute (outside of the fact that I am fussy about
> elements versus attributes, but that's for another argument, er,
> discussion).
>
>
>>
>> So, I was sort of right and sort-of wrong when I said that the short-
>> text should not describe the poster, but the media.  I'm right, the
>> element is more than the first frame or poster frame.  I'm wrong, in
>> that the (in this case sighted) normal user would have gathered
>> something from that initial frame.
>>
>> so, not good:
>>
>> <video poster="TheLeopard.jpg" short-text="A movie poster for The
>> Leopard" src="..." />
>>
>> because the sighted user will know it's a video element and that it's
>> offering them the trailer.
>>
>> Way better is to relay some of the information from the poster:
>>
>> <video poster="TheLeopard.jpg" short-text="Trailer for The Leopard,
>> starring Burt Lancaster" src="..." />
>
> *IF* the author does indeed choose to use a movie poster as a first-frame
> image choice. But despite its poor choice of name, the image referenced by
> @poster today could be *any* image, including a pure-play branding image
> ("iTunes Theater Presents: The Leopard", where the imagery would be
> partially stock or specially commissioned imagery including the iTunes
> "logo", the sell line as imbedded display font, promotional movie stills,
> etc.) - in this case not only do we need a short textual description about
> the <video> - "Trailer for The Leopard, starring Burt Lancaster", we also
> need to provide the non-sighted user with the actual text burned into the
> image proper, and ideally a description of what that imagery is.
>
> In your example here, while the short-text value of "Trailer for The
> Leopard, starring Burt Lancaster" is indeed a short description of the
> video asset (the principle attribute of the <video> element, referenced by
> the src attribute), it conveys none of the information in the
> author-selected first-frame:
>
>        "An image of two film cans with Apples embossed upon them propped
> beside a film projector, and the text "iTunes Theater Presents: The
> Leopard""
>
> (see how you can actually visualize that?...)
>
> Earlier this week Leonie Watson summed it up quite clearly:
>
>        "When I arrive at a video (with my screen reader), I want to know
> what that static image/frame contains. At that moment in time, in the
> world according to me and my screen reader, that image exists entirely in
> its own right. It might be a still from the video, it might be a separate
> image. It might be related content, it might be a completely unrelated
> corporate ident (for example).
>
>        Wanting to know what that image contains doesn't prevent me from
> wanting to know what the video contains. There may well be overlap, but
> equally they could be worlds apart."
>
>
>>
>> the long description can provide a more narrative version of the
>> trailer, and the transcript a full transcript.
>
> At this time, the one thing that we all seem to be relatively in agreement
> on is that this particular requirement would most likely be handled by
> aria-describedby.
>
>        <video src="..." aria-describedby="synopsis"></video>
>        <p id="synopsis"> The Prince of Salina, a noble aristocrat of
> impeccable integrity, tries to preserve his family and class amid the
> tumultuous social upheavals of 1860's Sicily.</p>
>
>
> The assumption is that most videos will have *some* associated text
> describing something about the movie for sighted users on the same page,
> so that should be linked by aria-describedby (it is dangerous to make
> assumptions, true, however...). In the case where there is *no* on-screen
> description of the movie, then there would be no 'non-visual' description
> either - what's good for the goose is good for the gander, as my
> grandmother used to say.
>
>
>
>> This way the short text
>> is enabling the non-sighted user just like the sighted one:
>> sighted: see poster, decide it's interesting, watch trailer
>> non-sighted: get the short-text, decide it's interesting, read the long
>> description and/or transcript
>>
>> (I'm using non-sighted as a shorthand for someone who, for whatever
>> reason, can't see the video - their eyes are busy elsewhere, their UA
>> is unable to play it, and so on.  Hope that's OK).
>
> No Problem by me, we need to speak plainly sometimes, and I don't think
> you are being misunderstood here.
>
> ***************
>
> While on the topic of Plain Speaking:
>
> As a sighted user, I am always very careful when making assumptions and
> assertions on behalf of daily screen reader users. True enough that after
> over a decade of being an accessibility specialist I should have a pretty
> clear comprehension of the big picture, but still, I routinely discuss
> scenarios with a number of trusted blind users to be sure I have not
> strayed off track.
>
> Two such colleagues that I have previously discussed this subject with are
> Victor Tsaran (a daily screen reader user), who runs Yahoo!'s
> accessibility lab in Santa Clara and works with engineers and developers
> of all stripes as they produce web content for millions of daily consumers
> around the globe, and with Everett Zufelt, who has a CS Engineering
> background, is a daily screen reader user as well, and (as an engineer)
> has committed over 1,000 accessibility patches to the open source Drupal
> CMS system (http://groups.drupal.org/node/117539) - which is to say, he is
> smart, talented and gets it. Neither of them builds browsers or screen
> readers, but both have a first-hand perspective on delivering content to
> end users, both as content creators, as engineers supporting that content
> delivery, and as content consumers.
>
> After talking with Everett Friday, he wrote me an email, which he has
> subsequently shared to this list yesterday
> (http://lists.w3.org/Archives/Public/public-html-a11y/2011May/0356.html).
>
> I think that the critical thing that Everett has identified, that some
> others continue to argue against, is that there are in fact 3 things here
> that need to be conveyed to the non-sighted user, irrespective of their
> source or how we code things up:
>
> * The video player (initiated by the <video> element), which is the
> bounding box (or canvas as Everett described it) and the controls
> associated to the player (whether they are the browser's native controls,
> or JavaScripted controls supplied by the author) - we need to ensure that
> all roles, states and properties are accessibly delivered
>
> * The 'video' itself - the media asset (which itself is a further
> composite of imagery, movement, sound, and text) - we need to ensure that
> each part of that composite asset has accessible alternatives
>
> * The 'still' image (regardless of its source or specific content).
>
> Since both the video and the still image are non-textual objects, and WCAG
> 2 1.1 clearly states that *any* non-textual object requires textual
> equivalents, it is abundantly clear that we need mechanisms to provide
> that text for both the video and the still image.
>
>
> Silvia Pfeiffer wrote:
>>
>> The point that nobody seems to understand is that there is no need to
>> provide a text alternative for the video. All we need is a text
>> alternative for the poster (read: placeholder image). The video's
>> content is not presented at the time where a text alternative for the
>> video *element* is needed.
>
> Stating that Victor and Everett, both daily screen readers and working
> engineers, "don't understand what they need" simply doesn't cut it - they
> clearly *do* understand: they understand engineering, they understand the
> web, they understand their AT tools and they understand their user
> experience. The video is the video, the still imagery is the still
> imagery, and both require a short textual alternative, as well as a longer
> textual description if and when appropriate.
>
> When Jared Smith, the Associate Director of WebAIM ("...and if anyone
> should know the best way it should be WebAIM." - Silvia Pfeiffer) writes
> to this list
> (http://lists.w3.org/Archives/Public/public-html-a11y/2011May/0322.html)
> and also confirms that the users and content authors that he interacts
> with daily have these requirements, or Leonie Watson, the Director of
> Accessibility at Nomensa, a leading UK-based web agency with clients such
> as P&G, Virgin, Nottingham University, the UK Treasury & UK Ministry of
> Justice (and more), writes to also confirm these needs in her first person
> voice
> (http://lists.w3.org/Archives/Public/public-html-a11y/2011May/0301.html),
> we must stop and ask, who is not really understanding?
>
> You do not solve use-cases and user requirements by insisting they don't
> exist. Arguments that we do not need both of these types of textual
> descriptions cannot be accepted - such arguments are (for me at least) a
> deal breaker - this is a hill I am prepared to die on. There are enough
> voices of blind users and accessibility specialists on this list alone who
> have made this statement of need that we simply cannot ignore their
> request, regardless of how clear or confused those initial statements of
> request were perhaps conveyed. If some contributors to this list cannot
> understand why these requirements exists, I am sorry that we have not been
> able to better explain why - it's not been for lack of trying. But it
> reaches a point where, if you still do not understand the "why", you need
> to trust those who are directly affected when they say they need
> something, and figure out a way to deliver it, even if you still don't
> fully understand why.


I am sorry, but you have taken this out of context. I was concretely
talking about the point in time where the video has been loaded, is
paused and only a representative image is visible on screen. It is
ONLY this situation that I was referring to, when I said that we do
not need to represent the content of the video at that time. And I was
referring to a situation where there is no other information about the
video available on the Web page. I still believe it would be wrong to
represent the content of the video in this situation (and in your
arguments above you seem to agree with that). I have advocated though
that we need to represent the content of the representative image. In
actual fact, I believe we are both arguing the same thing, except with
different solutions and with mixing use cases with each other.


> It is my belief that until such time as we are in agreement on what *all*
> of our actual needs are, we will continue to be talking about incomplete,
> confusing or conflicting potential solutions. Before proposing aria-label
> or @title be shoe-horned in there somewhere for "alt technologies for
> video", let's be very clear what we are providing alternative texts for,
> and then we can look to effectively deliver those solutions.

I agree that we haven't clarified the different use cases / situations
yet. We need to clearly list them and then define which attributes /
elements provide the solution for each of these use cases. I was
trying to make a start on this in the wiki page, but I have obivously
also mixed used cases, so let's identify the different dimensions that
we have and then map the solutions.

I suggest as a start on listing the dimensions of interest that we
should discuss we have the following:

* graphical browser / text-only browser: i.e. is the video element visible?
  This can be identified through the display of all HTML content
inside the <video> element.

* video player design (as per Everett's suggestion): i.e. is it the
native browser player or something custom
  This can be identified through the absense of the @controls attribute.

* presence of a representative frame?
  This can be identified through the absence of the @autoplay attribute.

* presence of a short video description on the page?
  This would be represented through the presence of a video text title
on the page.

* presence of a video summary on the page?
  This would be represented through a description/summary of the video
content on the page.

* presence of a full-text transcript of the video content on the page
or on a different page?
  This would be represented through a (possibly interactive and
possibly timed) transcript on the page or on a different page.

* presence of a short poster summary on the page?
  This would be represented through the presence of a poster title on the page.

* presence of a long poster description on the page?
  This would be represented through a description/summary of the
poster content on the page.

Have I missed any dimension?

Cheers,
Silvia.
Received on Sunday, 15 May 2011 02:46:02 UTC