RE: Accessibility of <audio> and <video> from John Foliot - WATS.ca on 2008-10-16 (wai-xtech@w3.org from October 2008)

From: John Foliot - WATS.ca <foliot@wats.ca>
Date: Thu, 16 Oct 2008 13:23:16 -0700
To: "'Laura Carlson'" <laura.lee.carlson@gmail.com>, "'W3C WAI-XTECH'" <wai-xtech@w3.org>, "'html4all'" <list@html4all.org>, <public-html@w3.org>
Cc: <ian@hixie.ch>
Message-ID: <009101c92fcd$06272680$12757380$@ca>
Ian Hickson wrote:
> 
> 
> Fundamentally, I consider <video> and <audio> to be simply windows onto
> pre-existing content, much like <iframe>, but for media data instead of
> for "pages" or document data. Just as with <iframe>s, the principle I had
> in mind is that it should make sense for the user to take the content
> of the element and view it independent of its hosting page.

If all we are getting is another region, why bother?  What exactly would be the difference between <video> and <object id="video"> or even <div id="video">?  There is a *reason* for specifically naming that region a container for video, and as such we should be able to then provide some native support to specific aspects of this form of multi-media delivery.

> You should be
> able to save the remote file locally and open it in a media player and
> you should be able to write a new page with a different media player
> interface, without losing any key aspect of the media. In particular,
> any accessibility features must not be lost when doing this. For example,
> if the video has subtitles or PiP hand language signing, or multiple audio
> tracks, or a transcript, or lyrics, or metadata, _all_ of this data
> should survive even if the video file is saved locally without the embedding
> page.

In principle this is not wrong, but it introduces a level of bloat and complexity that should not exist.  "Burning" all modes of information transfer into one single unit makes it difficult to substitute or access specific modes of information - detangling as it were - based upon user need.  It also flies in the face of current web content construction: we don't "burn" images into web pages, but rather reference them via URI, so instead of using a tried and true convention, you are proposing a "new" means of delivering content via the web, simply because it now has the added ability to include an audio stream and motion simultaneously?

Consider this: burning a transcript/caption into a media file presumes that the user cannot hear the audio stream, but can read the onscreen text.  But how then could we re-purpose a transcript text for Braille output?  Extraction like that becomes difficult to impossible.  What if you want to also include the PiP Ian references, along with a caption file in the native language, plus perhaps translated sub-titling as well?  This introduces file bloat, as *all* of that digital data must travel down the pipe - there is no way that the end user gets to choose what they need or want - it becomes an all or nothing proposition - OLE for the web?

> 
> It turns out that this is actually not a huge problem -- video formats
> already have to deal with this.

This presumes that you do not completely understand the problems noted above.  It is not a forward thinking position, but rather one that suggests that the status quo is good enough.  There have been too numerous debates to count on whether HTML 5 should be pushing the envelope of rather reflect current practices (the now aged "Pave the cow paths" discussion), but if we are to improve an already difficult situation, we should, IMHO, be thinking about how to improve problems that might not be completely self-evident to the casual observer.  Delivering multi-media content is one such area, and I think the current proposal falls significantly short here.

> If you buy video from iTunes, all the
> metadata is within the file. If you buy audio from Amazon, all the
> metadata is within the file. If you transfer a video track from a DVD
> to a hard disk, one MPEG file can contain all the subtitles, video, and
> audio tracks.

There is a world of difference between metadata and full data - as the difference in names alone suggests.  Let's look at the above noted examples:

1) video from iTunes:
Currently, if/when a captioned video asset is made available via iTunes, it is true that the caption file is burned into the .m4v file (using the bizarro, but apparently TV industry standard .scc binary caption format).  This is a problem from a number of different perspectives: first, working with the .scc format is extremely complicated (it shouldn't be, but it is, and I know this from personal experience), so the entry barrier for even those who wish to "Do the Right Thing" is high (although this could and probably will change).  But the larger problem is that because of its direct 'burning' into the media, the text transcript becomes "hidden" from indexing engines; it cannot be outputted to alternative output devices (such as a Braille refresh bar), the text cannot be re-scaled (you should see how small captions are on an iPhone), and in effect you have a moving image file with text embedded into it, and no @alt value associated to it.  Accessible? You decide.

2) audio from Amazon - (all the metadata is within the file):
But is it?  Oh sure, you get the artist, title, song-length and maybe even the "album cover", but do you get the full CD booklet, complete with liner notes, alternative photos, production notes, thanks-you's from the band, etc.?  Back when I bought vinyl LPs, "albums" even came with posters, postcards (and I think I still have my Billion Dollar Bill from Alice Cooper's Billion Dollar Babies LP).  Nope, in the interest of smaller file sizes you get the bare-bones minimum.  However, when it comes to accessibility, bare-bones minimum is often not enough, so we need to ensure that if/when you need more, you can get at it.  Keeping the additional pieces separate, but easily accessed on demand is an efficient and scalable solution: hoping that everything can be burned into one asset is almost a guarantee that the additional stuff will not be included.

3) a video track [on] a DVD to a hard disk - (one MPEG file can contain all the subtitles, video, and audio tracks):
Really?  Have you actually "cracked open" a DVD and studied it, or made a copy of it?  As someone who has copied more than one movie from disc-to disc (using a nifty little tool called DVD Shrink) I know with certainty that DVD's have numerous "regions", with each region containing a part of the total presentation.  To minimize compression of the movie (to fit on an off-the-shelf DVD) I routinely deselect Spanish subtitling, (often English captioning too as I have no specific need), and even remove "trailer" data that I really do not need or want.  Thus the DVD model is actually one of compartmentalization (including chapters - from the onscreen menu you can do "scene selection"), and *not* one single monolith file.  The DVD model also allows you to toggle on and off captions or sub-titling and take advantage of numerous other "Menu" commands.  Why can't the newly minted <video> element look to feature this type of enhance functionality?


> Furthermore, unlike images, videos tend to stand alone -- where
> an image can mean different things in different contexts, video files tend
> to mean the same thing in all contexts, so the accessibility alternatives
> aren't situational like image alternative text.

This is a wrong view of video accessibility.  As with any other multi-modal delivery mechanism, access to the various modes might differ from person to person.  While there may be less ambiguity about the "meaning" of the asset, the means by which a particular user might access some or all of the supporting pieces that make the multi-media "multi" differs.  This is where accessibility needs to be addressed.

> 
> Thus, a fundamental principle of how this feature was designed is that
> any
> accessibility features and metadata features must be within the video
> or
> audio resource, and not in the HTML markup. The hypothesis is that this
> results in the optimal experience for all users.
> 

I think that this hypothesis is now open for serious questioning.
(And as an aside, I thought that HTML5 specs were determined by user testing and proof, and *not* by hypothesis...)

****************
Comments on comments:
>> The current draft, as I read it, does not support any type of 
>> @transcript or @alt type of attribute on either of these tags, making 
>> them both inaccessible to hearing impaired users, and video 
>> inaccessible to vision impaired users as well.

Ian Hickson wrote:
> 
> I hope that the above introduction now explains the reason for this.
> It's
> not that hearing impaired users (and many other users with special
> needs)
> are excluded, but that to best serve them we should address their needs
> in
> the media resources themselves.

This simply shifts the burden of unbundling to the media player and/or design of the media.  If HTML keeps images (be they .gif, .jpg, or .png) as separate yet linked assets in a "document", why would we use a different model for a "video document"?

Ian Hickson wrote:
> 
> This isn't the intent. The fallback is only intended for legacy UAs. I
> would expect all HTML5 UAs, including ATs, to support <video> and
> <audio>,
> exposing alternative tracks, subtitles, etc.

How?  How will the UAs detangle the various pieces required, based upon user demand?  How will the UA even notify the user of one or more less obvious features of a media asset: the presence of a sub-title file, or a caption file or alternative track?  What part of the spec instructs us on how this is achieved? Have any of the UA implementers that are part of the Working Group been able to provide a proof of concept?  Will the user be able to enlarge text without the need to enlarge an entire video screen, if for example, the onscreen action is perceived as only supporting the transcript, and not vice-versa? (Consider: How many times have you listened to the TV while reading a book or magazine?  Is the TV screen or the audio stream the "more important" then?)

> > > Actually, I think the idea is that the content stream itself is
> > > accessible.
> >
> > That is not something that the HTML specification can (or should)
> > mandate.

Ian Hickson wrote:
> Indeed, we are somewhat at the mercy of whatever codec and container
> formats we end up picking. We should definitely consider accessibility
> a high priority when selecting a good codec, though. For example, we
> shouldn't pick a codec that requires that subtitles be burnt in.

This completely contradicts what you said in your opening piece: "... one MPEG file can contain all the subtitles, video, and audio tracks."  This is quite confusing.

> >
> > So if the UA doesn't support the particular codec but does support
> > <video> it should not show the fallback content?
> 
> Correct.

So we *are* looking for a one-size-fits all solution.  Thus any such solution needs to ensure that each mode of the multi-modal/multi-media asset can be exposed and used independent of any of the other modes.  That is what is going to ensure real accessibility.

> >
> > Scenario : the Principal addresses the University and his address is
> > recorded; his aide asks the webmaster to put the video up on the web.
> > The webmaster looks at the video and finds there are no closed-
> captions,
> > no subtitles, no accessibility features at all.  What is he to do ?
> > Refuse to put it up.  That would be a brave webmaster indeed.  No,
> > instead he puts it up, then relies on the intelligent design of HTML
> 5
> > to allow him to add accessibility features to overcome the
> deficiencies
> > of the raw material. And it is our responsibility to make sure that
> he
> > can do this.

In the real university scenario that I work at, the video will go up ASAP, whilst the caption piece will follow along as soon as it is ready.  This might not be perfect, but it *is* pragmatic.  Being able to upload to the server the transcript/caption file 12 to 72 hours after the video file and have it "catch up" to the video "...already in progress" isn't my first choice, but it does achieve the big-picture end goal, without the need to do further post-production on the video asset.  The current media play simply knows that now the transcript is available, and exposes the toggle button on the controls.  Intelligent, efficient, and graceful.

> 
> The idea is that he would put those accessibility features in the video
> file itself, rather than just in the HTML. That way they don't get lost
> when the user moves the file around (e.g. to use their accessibility-
> optimised video player).

This is simply silly.  Today I can "save" a web page to my local drive, and the UA saves the HTML document, the CSS document and all of the images (pathing intact no less).  Nothing gets "lost", so how/why would it be any different if the html document had referenced links to the various pieces that comprise the complete multi-media experience?  This is a hollow and unfounded assertion.


> >
> > OK, that's a constructive suggestion.  But it also seems to be an
> > attempt to hive off the responsibility for providing accessibility
> > features on others (in this case, the designers of SMIL).  Why don't
> we
> > just bite the bullet and make HTML 5 accessible, full stop ?
> 
> We should look at the Web platform as a whole as a single entity. HTML5
> is
> just one small part of that. Given that outlook, it doesn't really
> matter
> which committee designs the accessibility features, so long as they are
> there. What matters is what is the best technical solution for them.
> The
> best technical solution here is to have the accessibility features as
> closely associated with the moving pixels as possible, in the video
> file
> itself.

Or so you hypothesize.  Once again I put the burden of proof on you to prove that this assertion is true - I believe that I have refuted your major reasons already, but would welcome further data that suggests that the "media as monolith" scenario you propose is currently in practice and meets end user accessibility requirements.  

My experience with captioning .m4v files for deliver to mobile devices such as the iPhone suggest otherwise. (And I can provide examples of such as required: http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=272900846 and download the last two episodes - they are captioned. Can you scale the text?  Can you extract the text and output it to a Braille output device? No.  By contrast, see: http://ecorner.stanford.edu/authorMaterialInfo.html?mid=1532 where the transcript [and translated transcript] can be accessed independent of the media. THIS is what we should be striving for IMHO)

> 
> One could equally ask the question "Why don't we just bite the bullet
> and
> make HTML 5 define the entire video format". Features like subtitles
> are
> no more or less important than features like the actual moving pixels
> of
> the dog on the skateboard -- they are all part and parcel of what the
> video is, and should be together.

In an OLE world, perhaps.  But the web is not OLE, each piece of the rendered onscreen output is a separate and inter-linked resource.  Images stand alone, style sheets stand alone, HTML stands alone, scripting stands alone..... (and each piece can be download, saved and otherwise manipulated alone)

> 
> Getting back to video and audio, transcripts and lyrics are a good
> example
> of this. Instead of just providing transcripts to users who can't make
> full use of the video or audio, it is better to provide them to
> everyone.
> Many users who are otherwise quite capable of looking at video and
> audio
> may desire transcripts.

Correct.  A deaf person might want the lyrics to a song and not need the audio stream.  Yet the possibility may exist that the only place the lyric currently exists is 'burned' into the audio asset.  Excessive overhead for zero gain - it would be far simpler to simply grab the piece required - the text lyrics - and leave the rest of the media behind: a more bandwidth friendly and efficient solution.  As an added bonus, if the lyric file (in POSH) included a <link rev=""> to the media itself then it makes the media that much more 'index-able' than if it were burned into the song.

> On Mon, 1 Sep 2008, Lachlan Hunt wrote:
> >
> > OK, let's look at the various kind of alternatives that could be
> potentially
> > provided for people who either can't watch or don't want to watch the
> video.
<SNIP>
> > Given that the alternative content is useful to so many people,
> > regardless of physical disability or technological limitations, it
> makes
> > sense for it to be provided in a way that makes it available to
> > everyone.  This is one reason why hiding alternative content away
> within
> > the video element is not helpful because it only makes it readily
> > available to a small subset of those people who might want or need
> it.

Here, Lachlan and I actually are on the relative same page. I would add to his thoughts by suggesting that each alternative should be easily accessed independent of other supporting pieces as required.  Think "right click" on a webpage image and "view image".

Ian Hixson wrote: 
> I encourage people to register these rel="" values in the wiki and to
> try
> them. I am very interested in what experience with this teaches us. If
> it
> turns out to be a good idea, it's definitely something we could add to
> the
> spec.

+1.  I have earlier suggested how <link rev=""> used with captions/transcripts/lyrics might have huge payoffs for SEO, in that media assets could have a much finer, granular indexing, base not upon "keywords"(metadata), but actual content.


> > But there doesn't seem to be a desire for such use of videos -- they
> all
> > seem to be in the category of being 'important content' on the page -
> -
> > so, as Lachlan suggests, alternative representations could be
> embedded
> > in the video and still be appropriate.
> 
> Agreed.

Disagreed.

> On Wed, 3 Sep 2008, Dave Singer wrote:
> >
> > We've actually been thinking about the framework for accessibility of
> > media elements in HTML5.  Note that this is rather different from
> > discussing (say) caption formats or the like.  I've attached a
> 'thought
> > piece' on the subject, which attempts to lay out some of the needs as
> we
> > see them, and also proposes a way ahead.
> >
> > http://lists.w3.org/Archives/Public/public-html/2008Sep/att-
> 0118/html5-media-accedssibility.html
> 
> In general I agree with this document, though, for the reasons
> described
> above, I do not agree with the conclusions regarding providing alt,
> longdesc, or other fallback inside the HTML file itself.

And for reasons I have elaborated on at length, I do not agree with *your* conclusions.  At the very least then, I do not believe this should be considered "closed".

Sincerely

JF
Received on Thursday, 16 October 2008 20:24:05 UTC