Re: [media] discussion on how to render captions in HTML5

On Fri, Aug 20, 2010 at 3:18 AM, Eric Carlson <eric.carlson@apple.com>wrote:

>
> On Aug 18, 2010, at 4:40 PM, Silvia Pfeiffer wrote:
>
> Hi media a11y folks, hi Eric,
>
> In today's call we had a brief discussion on the WHATWG specification of
> rendering of time-synchronized text with audio and video resources, see
> http://www.whatwg.org/specs/web-apps/current-work/complete/rendering.html#timed-tracks-0.
>
> The point I was making was that I am disappointed that for <audio> elements
> there is no rendering. I was suggesting that for both, <audio> and <video>
> elements, rendering of time-synchronized text should depend on the @controls
> attribute of the <audio> and <video> elements. The reason behind this is
> that I expect a menu to be made available to the user through the @controls
> that allows the user to activate/deactivate text tracks from the list of
> available text tracks. Because that list is made available through the
> @controls, I would also expect that the rendering of the text cues depends
> on this @controls attribute being available.
>
> However, the specification says that <audio> elements don't render any
> time-synchronized text, but only <video> elements do.
>
> We didn't get very far in the discussion - in particular Eric had some
> important points to make. Thus, I'd like to take up this discussion here
> again.
>
>   My point is that semantically, the *only* difference
> between <video> and <audio> elements is the former renders visual media
> while the later does not. There is absolutely no requirement that a file in
> a <video> element must have visual media, eg. it is perfectly legal to use
> an mp3 file in a <video> element. For example, the following :
>
>     <video src="song.mp3" id="video" controls> </video>
>
> creates a 300x150 element that has only audio data so it doesn't draw
> anything (300x150 is the default size of a <video> element).
>
>   There is also no requirement that an <audio> element must not support
> files with visual media, it just doesn't render visual data. In the
> following example, an HD movie trailer in an <audio> element. The 'controls'
> attribute tells the UA to show the default controls, so it does take up
> space on the page but it does not render the visual track:
>
>     <audio src="HD_trailer_1024x768.mp4" id="audio" controls> </audio>
>
>  I believe that time-synchronized text, whether it come from a track in the
> media file or is loaded from an external file, is *visual media* - it has a
> visual representation - so I don't believe it make sense for an <audio>
> element to render them.
>
>   Silvia's proposal is to render text cues when an <audio> element has the
> 'controls' attribute. This might work for text-only cues, but if we allow an
> <audio> element to render time-synchronized text we of course have to allow
> it to render burned-in captions, sign language tracks, etc. Those are both
> video tracks, so her proposal is actually to make an <audio> element behave
> like a <video> element when it has a 'controls' attribute. This change would
> break the previous example because the video track would be rendered,
> requiring the page author to edit the movie to remove the video track.
>

I wasn't actually going to allow rendering of video in <audio> elements -
just rendering of text tracks.

It might be instructive to consider what types of tracks we expect to be
added onto what traditionally is regarded as an audio resource and a video
resource and whether we still regard audio resources with an in-band sign
language track as an audio resource or whether that has moved into the video
resource bag. I'd almost say the latter, even if the resource doesn't have a
"main" video track.


  I believe that the person creating the web page should use a <video>
> element if a file has visual media, whether it comes from a video track or
> from time-synchronized text.
>


Let me ask a few questions - some of which I don't think have been
explicitly addressed yet.

Right now, the controls of on audio and video are the same. In
http://www.whatwg.org/specs/web-apps/current-work/multipage/video.html#user-interfaceit
says that we will have a display for alternative accessibility tracks
in
the controls: to "change the display of closed captions or embedded
sign-language tracks, select different audio tracks or turn on audio
descriptions" - which to me indicates introduction of a menu to
activate/deactivate tracks by the user.

Assuming we have an audio file that has externally associated text tracks
and also in-band text tracks. Are we planning to have different controls
display on such a file when used in <audio> to when it is being used in
<video>? I.e. will the available text tracks be exposed in a menu in the
@controls on an <audio> element? If so, what should happen when a user
activates a track? Unless the Web page author is using the JavaScript API to
render the activated tracks, the user will not see any reaction to their
activation of tracks. If instead we decide not to include the available
accessibility in the @controls of an <audio> element, we've actually just
made the element in-accessible.

When we move beyond text tracks, the boundaries indeed become blurred.
Adding a video track with sign language to an audio file does beg the
question whether this is now still a audio resource or has really just
turned into a video resource. I would say that your logic applies to this
case. But I am hesitant to accept that logic in the text track case. I
actually think that the current WHATWG spec doesn't yet address what to do
about rendering of audio and video accessibility tracks.

As for videos with burnt-in captions - we just have to regard the captions
there as part of the video pixels and thus if we use such a video in an
<audio> element, they will indeed disappear. But that looks like it also was
the intention of the Web page author. If you want to handle such captions
explicitly, you have to either do OCR and extract them or re-type the
captions into a caption file.

I understand where the logic is coming from with regarding <audio> as a
non-visual medium and <video> as the visual medium. I can accept that when
sign language tracks are there. I do wonder, however, what would be so bad
about allowing text tracks to be rendered with audio. As the main use cases
here I would see foreign-language users, deaf and hearing ppl sharing the
experience, learning-impaired users, chapter navigation, and music
lyrics/karaoke users. None of these use cases ask for sign language, but all
of these are a basic accessibility need on what we traditionally regard as
audio resources.

Cheers,
Silvia.

Received on Thursday, 19 August 2010 22:58:33 UTC