Re: [media] change proposals for issue-152 from Silvia Pfeiffer on 2011-03-30 (public-html-a11y@w3.org from March 2011)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Tue, 29 Mar 2011 17:37:20 -0700
To: Sean Hayes <Sean.Hayes@microsoft.com>
Cc: HTML Accessibility Task Force <public-html-a11y@w3.org>
Message-ID: <AANLkTinzx8EgWvfU9z2KQ=exshRbv2wsq_8m=bHVS=Fq@mail.gmail.com>
Hi Sean,

On Tue, Mar 29, 2011 at 9:58 AM, Sean Hayes <Sean.Hayes@microsoft.com> wrote:
> It's possible that my issues aren't all tightly coupled to the idea of having track elements handled in the same way as video elements. Let's be clear then my objection to your proposal is having tracks containing text be handled in a fundamentally different way to tracks containing video and being constrained to the video rectangle. I don't need text tracks to be top level elements, nor indeed any specific markup/API solution, but it does seem to me that striving for a smaller set of components and have them share a common model where possible is a good thing when designing a new feature.
>
> "I'm sorry, but it seems to me that you might have mistaken a joke for agreement."
> 10 hours of discussion over 3 sessions and writing up a summary in the wiki-page is a very elaborate joke, I'm not sure what the point of it was but clearly worked as it does indeed seem to have been a waste of my time being present at the f2f.


Sorry I was unclear about the situation. The two days of hard work on
option 10 were indeed no joke at all, but serious design work. As was
the design work on the eventual change proposal. The eventual change
proposal is a compromise (as are many things in HTML). The only bit
that I joked about was to pull all tracks out from underneath <video>,
including <track>. I'm terribly sorry about that misunderstanding.

I understand your reasoning to support option number 10 and considered
it as a valid alternative during much of the F2F, too. From a "clean
design" point of view for the markup, it still makes a lot of sense to
me. Having gone through the exercise of defining all the details on
that proposal was a very good learning experience and showed me how
much duplication it actually entails, in particular in the Javascript
API for audio and video. Which is why we arrived at the compromise in
the last minute. I would appreciate your input into the compromise,
since many of your arguments hold. I think there are indeed things we
didn't fully spec out in proposal2 given the short amount of remaining
time.



> In order for me to understand your proposal perhaps you'd address the following:

Thanks. More than happy to.


> If it's difficult to put the text track in the viewport of a video when it's a separate element, how do you propose doing it for video?

Difficulty in rendering is about the default rendering, which should
make immediate sense to the user and cover the 80% use case. Text
should by default be rendered on top of the main video's viewport.
Additional video tracks should by default be rendered next to the main
video. This is achieved by proposal 2. Any other custom rendering has
to be done through CSS, including picture-in-picture if that is
required.

Your proposal 3 has the same default rendering for video as proposal
2, plus it has the additional problem that the cues are also rendered
separately somewhere on the page, giving little default indication of
the relationship to the user.

Option 10 in contrast has the problem that everything is rendered into
the same video viewport. The proposed default rendering was to have
the multiple video tracks each be rendered on top of the main video,
thus randomly obstructing that video, which would not satisfy the 80%
use case.

So, only proposal 2 provides the correct default rendering both for
video and text.


> Can you describe in the "embedded in viewport" model how I spread captions across two videos placed side by side.

This can only be done using CSS. This is true for all existing
proposals and therefore not an objection to any of them.


> The networkState states (or something very like them) are likely to be required if we ever intend to support live streamed captions, what's the plan for that?

As I said: I can appreciate that we might need the networkState for
text tracks. That is a separate issue from the one we are discussing
here and a change that may be necessary for <track> anyway. I think we
have a lot of feedback on <track> itself and need to address that
properly. Let's keep it for another day.


> Separating out the videos does not necessarily make life easier, not only do you have to explain away the redundant attributes, and continually repeat the timeline='...' attribute on slave elements, making it more verbose and error prone, you now have the opportunity for a whole bunch of coding errors that you wouldn't have to deal with in a nested model, for example:
>
> What is the behavior if video A references video B's timeline, and video B references video A's timeline? Who gets the controls?
>
> Is it legal for video A to slave to video B which slaves to video C? If not what is the error behavior. If so, what is the behavior if there is a cycle?

This is a good point. I guess a slaved video cannot also be a master,
so the intention here is probably transitive and thus the master video
should probably get controls for all of them.

Also a cycle reference would make all of them slaves, so none gets the
controls. The behaviour would probably be undefined as it would be a
markup error.

But you are right - the use of the @timeline attribute makes the
relationship definition easily prone to faulty markup. This applies
both to proposal 2 and proposal 3.


> I strongly disagree that it is a good thing to have to make audio into a visual container in order to put captions into the page, it makes it less likely that authors are going to do the right thing. Moreover, since you still have to use CSS to make the null video have a sensible shape, why not apply CSS directly to the content you want to put in there.


This is again a different discussion to have, since it expresses a
general disagreement with the way in which the current spec works for
<audio>. <audio> is an element that has no visual presentation on
screen, so there is no container into which you can render text and
therefore there is no default rendering for <track> elements. They
are, however, allowed and then cues are just exposed through
JavaScript, so for custom display, it still works.

If you want to change the way in which the <audio> element works in
general, that is another discussion to have. Let's keep this separate
from the multitrack discussion.


> " Your proposal starts with the use of a text track that stands alone.
> What would be its visual representation? What the use case?"
> It's would be a block container, like div. It's use case is the ability to present timed markup anywhere in the layout, and in particular to provide one encompassing caption area over any number of videos (for example a page full of thumbnails).


When there is no video element on the page, do the cues continue to
display on a timeline? Do they have default controls? What if there
are captions that are only activated 5min in - does the user have to
wait for 5min to display them?

And when there is actually a video element on the page - or several.
Assuming they are all slaved to a master. Do you expect a default
rendering across all of them? What if they are in different locations
on the page? If you don't have such a default rendering, your proposal
is in worse shape than proposal 2, because at least it has a default
rendering and you can use JavaScript and CSS to create the display
that you are proposing.


> "So, are you saying that you still favor the #10 solution that we first discussed in San Diego?"
> As addressing the principles I'm concerned about and as the starting point for continued discussion yes. As a concrete solution, no not necessarily.
>
> "Are you concerned about black bars and the like?"
> No. I'm concerned with a page that contains a set of videos, some of which may be too small (e.g. thumbnails) to effectively display captions in their viewport, and having a place to put those captions over the set as a whole.

That's an authoring issue. If your video viewport is too small, you
should turn off automatic rendering and render your captions manually.


> "The main reason for moving away from it is that we realized that we
> were re-inventing for audio and video tracks exactly the same
> functionality that is already present for audio and video elements."
> Only you are still re-inventing, because now you have to add a whole bunch of special case code for top level video elements that aren't really top level elements to unhook their controls, handle their text tracks, remove the poster etc. and logic to deal with errors in hooking the elements together.

You are right, there is some special functionality: the control on the
master element gets a menu to turn tracks on and off, which includes
the tracks of the slave elements. Also, all the timelines of all the
elements are one and the same. Because all the timelines are the same,
some of the IDL attributes that are related to playback of the slaves
obviously cannot mean the same any more as when they were standalone.
These changes are necessary for any multitrack solution.


However, everything else is still possible, in particular:
* it is possible to turn on controls on the slave videos and audios.
Interacting with them is like interacting with the master.
* it is still possible to attach text tracks to each video
individually with individual presentation. You could for example have
a translation of a sign language track be displayed directly on top of
the sign language video, while the main video has the transcript of
what is being said (thus helping people that are learning sign
language).
* it is still possible to read the state of the individual elements
and determine what they are up to and associate events to them etc.


> "it makes a lot of sense to have multiple tracks
> displayed next to each other rather than obstruct each other by trying
> to render into the same viewport. An author would be utterly confused
> if he defined multiple video tracks, but would only every by default
> see a single video track".
> Right, but that's only a result of the insistence that the video creates a viewport. If instead it created an equivalent to an absolutely positioned containing box that behaves as a flow container that expands to accommodate its children, authors would get an different experience. Since all of your examples put the videos in a parent div anyway that does essentially that, it seems to me that's the most likely scenario anyway. Text tracks, in order to overlay the parent, can be defined as display:absolute with default origin and extent calculated to the video rendering area.


In our original approach to the problem with option 10, I suggested
changing the meaning of the video viewport to an element that will be
filled with the video frames from all the tracks of the resource
arranged inside the viewport as neighbors ("tiling"). This is
introducing a whole new flow model for the viewport, in particular
when we also want to position and display text tracks.

I think it was this part about option 10 that made Eric and Frank
cringe the most: they don't want to introduce a new layout engine for
the video viewport, when the CSS layout engine already provides what
is needed for multiple videos.

I still maintain that the most typical layouts of multitrack video
are: tiled, picture-in-picture, and as a scrollable list. I would
personally like to see these encoded in CSS and thus make multiple
videos layed out by just choosing one CSS value. However, I can see
how that is an enormous burden on a browser and am happy to use a
different approach and expect authors to do the styling.

Cheers,
Silvia.
Received on Wednesday, 30 March 2011 00:38:13 UTC