RE: Requirements for external text alternatives for audio/video from Sean Hayes on 2010-04-03 (public-html-a11y@w3.org from April 2010)

From: Sean Hayes <Sean.Hayes@microsoft.com>
Date: Sat, 3 Apr 2010 11:06:17 +0000
To: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
CC: Laura Carlson <laura.lee.carlson@gmail.com>, Eric Carlson <eric.carlson@apple.com>, Geoff Freed <geoff_freed@wgbh.org>, "HTML Accessibility Task Force" <public-html-a11y@w3.org>, Matt May <mattmay@adobe.com>, Philippe Le Hegaret <plh@w3.org>
Message-ID: <8DEFC0D8B72E054E97DC307774FE4B911A4765FC@DB3EX14MBXC301.europe.corp.microsoft.c>
I'm not really drawing a line in the sand, my main concern here is that while these are all interesting ideas, and could indeed have accessibility benefits; they will require a lot of thought to get them right, which will slow us down. The group is already under considerable pressure to get results to the WG and I'd like to see us just get the stuff we know we need to do done first, before we embark on any grand experiments.

You seem to have a mental model of how hyperlinks in captions would work  "If you click on such a link, the media resource would be paused together with its dependent tracks (including the captions). As you return to the media resource, it is unpaused and you continue to experience it ", but that's not how hyperlinks work in HTML where navigation is stateless, so it's not clear to me that would be the navigation model. Why wouldn't they in fact navigate the host page? Or the video source, or to another point on the timeline of the video resource without pausing, or the caption source? If they navigate to a timed text resource and the video is paused, then what provides the time-base for the resource you navigate to; if they don't, what kind of resource is it in fact that they navigate to?

Whatever the model, it would require specification and trial implementations before we get it right. I'm not saying it couldn't be made to work, just that it's all new stuff that would take time to be specified and built out, and I don't want to hold basic captions to ransom till we figure it out, possibly at the expense of missing the HTML5 boat altogether.

The functionality you are talking about could already in fact be built into the hosting HTML webpage through script, and the proposed media API without having to involve captions so it's not like you won't be able to achieve these things independently.

TTML was designed to fit as a timed text resource into a wider Web context, such as SMIL or HTML+TIME which are already endowed with such semantics.  Not everything on the web has to be intrinsically interactive, PNG for example. You can make it interactive within a context, for example HTML image maps. This was the philosophy behind the decision to leave linking out of TTML. TTML is primarily designed to be slaved to an external clock source. When the audio and video is made interactive, for which SMIL is probably a better starting point than TTML, then TTML would fit into that world.


-----Original Message-----
From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com]
Sent: Saturday, April 03, 2010 10:13 AM
To: Sean Hayes
Cc: Laura Carlson; Eric Carlson; Geoff Freed; HTML Accessibility Task Force; Matt May; Philippe Le Hegaret
Subject: Re: Requirements for external text alternatives for audio/video

Hi Sean,

I have hesitated to reply to this email simply because you are drawing a line in the sand for what is accessibility for media and what isn't that I don't quite follow. The line between accessibility, usability and interactivity for Web media resources is IMO not as clear-cut as you make it to be.

It is true: some media accessibility topics are clear and "well understood" as you say: captions, audio descriptions and transcripts.
These have originated from traditional audio-visual systems, in particular TV, where there are no hyperlinks. Does this mean that hyperlinks should be excluded from captions when we take captions to the Web? Does this mean that DAISY-style interactions should not be allowed for media resources on the Web, because we are modeling our view of media on the Web on traditional media?

When we took text documents from the desktop to the Web, the one thing that made a difference and that defined Web documents were hyperlinks.
We have even seen the reverse happen: Word documents which are traditional desktop text documents now include hyperlinks. I strongly believe that in future we will see hyperlinks in captions and subtitles on interactive TV. Why should we not start with such a "revolutionary" concept on the Web?

Further, I would be very disappointed if a "timed text markup language" is nothing but a "caption format". A "markup language" has traditionally stood for marking up hyperlinked resources. I honestly cannot explain to my fellow Web developers why the W3C would develop a TTML without hyperlinks. They go "but it's the Web!...?" and "but it's the W3C!...?" and I have no answers other than stating that DFXP hasn't really been developed for a Web context. But I believe we can fix this.

It is true - traditional captions and subtitles don't have hyperlinks.
Those can continue to be used in this case. But why not also introduce "modern" captions - captions that do have hyperlinking functionality.
Are you concerned that those people that use captions will get more functionality from the captions that people who do not turn on the captions? Are you concerned that there will be useful relationships represented in captions that people that do not use captions will not receive? I say: it's a good thing! For once, give the HoH an advantage of those that aren't. And let them decide what good quality captions are - they can always turn such captions off that provide hyperlinks that are not acceptable, or choose an alternative without hyperlinks.

But let me address your objections:

* "audio is not interactive"

To this I would say: "not yet". As we introduce cue ranges or similar concepts, we will be able to introduce interactivity into audio and video resources. This introduction is also absolutely required: we are on the Web here and not on TV. Hyperlinking and interactivity are core to the Web. It is part of the whole "W3C Video on the Web" activity that we are part of, see http://www.w3.org/2007/08/video/ . Thus, with the introduction of interactivity to audio and video resources, there may well be a need to introduce a type of timed text track that provides alternatives for that interactivity - and we know from transcripts that hyperlinks are really important to link to text alternatives.


* "Adding interactivity to captions would break the semantic idea that they match the audio."

I do not see what the introduction of hyperlinks into captions has to do with breaking the "matching to the audio". Hyperlinks are URLs that are placed behind sections of text. That this text in this case are captions, i.e. a segment of text that is a (mostly) literal transcript of the audio in the video (or audio) does not destroy the "matching to the audio". If you click on such a link, the media resource would be paused together with its dependent tracks (including the captions). As you return to the media resource, it is unpaused and you continue to experience it. There is no breakage.


* "Adding interactivity to captions <..> could end up being badly abused and confusing for the user."

This is not something that can be controlled - ever. Abuse has happened on the Web since its beginning - hyperlinks can always point to the wrong content, bad content or confusing content. It's not a reason to remove hyperlinks from the Web and it shouldn't be an argument to stop their introduction into timed text.


* "Adding interactivity to captions <..> could end up introducing unnecessary security and social engineering issues."

There is indeed a problem that we have to solve with security issues if we allow text from a third party server to be interpreted and displayed in a Web page from a different server. But it has nothing to do with introducing hyperlinks - there are no new security issues created through having hyperlinks in captions.


* "Captions should be as near as possible the exact equivalent of the audio, with adequate typography to be easily readable."

That can be achieved also while hyperlinks can be achieved. They do not contradict each other.


Note that I would be open to introducing a different type of timed text track that is more interactive along the lines that you outline the AVTEF to be. I believe it does not require javascript injection, but that is certainly something to discuss.

The key points I wanted to make that strongly relate to this discussion though are:

* introducing hyperlinks in a timed text format is not difficult, but very powerful, and I do not see a reason why caption and subtitle files should be excluded from such functionality.

* whichever powerful timed text format we propose should allow for hyperlinks.

* this debate is important for the decision on how to implement captions & subtitles - if we chose an implementation that will not allow us to expose, e.g. hyperlinks, then it will be restricting what we can do in the future when we want more powerful timed text. While right now with captions and subtitles - in particular in SRT - there is no need for anything fancy and we can just hide it all in a shadow DOM, this may prove to be a short sighted decision in the future. I am just pointing out the bigger picture that we need to concern ourselves with.

Thus, I see direct and indirect relationships to accessibility issues in the hyperlinks discussion. It will not derail the whole concept of captions and subtitles if we don't let it. But it will help us make better decisions.

Best Regards,
Silvia.


On Tue, Mar 30, 2010 at 4:39 PM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
> I'm fully aware of what can be done with an interactive media system, I've worked on dozens of them over the last 20 odd years; what I'm saying is that you are trying to insert functionality here that the <video> and <audio> tag were not scoped for in HTML5, and doing so under the guise of accessibility seems to me somewhat contrived.
>
> There are a few well understood modes of accessibility for media which we need to address with high priority: captions, description, and transcript. Captions are a time based text equivalent to audio, audio is not interactive; and neither should the captions be. Adding interactivity to captions would break the semantic idea that they match the audio, and could end up being badly abused and confusing for the user, as well as introducing unnecessary security and social engineering issues. Captions should be as near as possible the exact equivalent of the audio, with adequate typography to be easily readable. Captions also belong to the media, and so if any branding is to be supplied then it should matched to the video content, not to the player, and it would be up to the content owner to supply the styling. Such branding should not be at the expense of readability. A similar argument would apply to subtitles.  The caption text needs to be available to assistive technology, but that does imply that the HTML author needs to get involved to make that happen.
>
> Now if you want to introduce interactive media into HTML5, without invoking the full SMIL model, then you could certainly define another kind of timed track, perhaps along the lines of ATVEF, which creates javascript events, and carries a payload which could be injected into the HTML DOM; this is quite powerful enough to do all the things you list and more, and I'd be happy to contribute to a debate on the pros and cons of such a model vs SMIL. However that debate should not be part of an accessibility discussion, and if we have it here I think there is a very real danger of derailing the whole concept of caption and subtitle support in HTML5.
>
> -----Original Message-----
> From: Silvia Pfeiffer [mailto:silviapfeiffer1@gmail.com]
> Sent: Monday, March 29, 2010 12:19 AM
> To: Sean Hayes
> Cc: Laura Carlson; Eric Carlson; Geoff Freed; HTML Accessibility Task
> Force; Matt May; Philippe Le Hegaret
> Subject: Re: Requirements for external text alternatives for
> audio/video
>
> On Mon, Mar 29, 2010 at 7:38 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
>> I don't disagree with the need to provide appropriate alternatives to media, but the mechanism of providing a transcript is perhaps not best provided through the mechanism of trapping captions. As you say, captions would in fact probably not be an adequate replacement for the media without the text of description being included at minimum. Thus a transcript is more like the alt text on an image, a different semantic beast than captions, and probably better provided by other means.
>>
>> I think there is an important larger issue here. Is the text mechanism intended to provide captions and subtitles; or is the intention, as Silvia's examples would seem to suggest, to use it turn HTML5 into a time based media like SMIL or HTML+TIME.   If the latter, and this mechanism is intended to address corporate branding and advertising, then I think we are straying out of the remit of accessibility into something much larger which would need to be taken up in the wider group.
>
> The two examples that you are providing are two extremes:
> captions/subtitles on the one end, and SMIL/HTML+Time on the other.
> Right now and for the purposes of this group we are focused on captions/subtitles. But already with the features of DFXP there is a possibility to go a step further, without going all the way to the complexity of SMIL/HTML+Time - which, IMO, needs to come in at a different level.
>
> What I was describing is simply time-aligned text that is a bit more capable than just being plain text. In particular I am talking about hyperlinks, which are essentially nothing more than styled text, but provide Web functionality - something that should be very important to us in the given context. This has nothing to do with going all the way to SMIL/HTML+Time. It is still no more than captions or subtitles, but with the possibility of linking out at a given time.
>
> Think about it: we could have captions that allow us to explain things further - e.g. a movie about a historic event with names of people mentioned and you could click through on the names of the people and find out what they were really like and why they are portrayed as they are in the movie. Directly related "supplementary material" - not banned to another resource as it currently is in DVDs. Actually available at your fingertip when you are interested in it.
>
> Or we could have captions of a political discussion with links to explain some background on the speakers.
>
> Or we could have captions that would link to a dictionary entry for words that are used very infrequently in a language.
>
> Or, of course, we could have links in ads to the eCommerce site of the current ad, so we can directly go and purchase the product.
>
> This is not difficult to do on top of what we have right now, but requires the ability to at least interact with links inside timed text.
>
> Note that I am not even sure if current DFXP/TTML supports hyperlinks, but if it doesn't I would be very keen on introducing them because they are extremely useful. Since DFXP/TTML is declared as being easily extensible, that should not be so hard to do.
>
> Regards,
> Silvia.
>
>
Received on Saturday, 3 April 2010 11:07:09 UTC