Re: Action-219: Draft Response to MSE on Bug 23661 from Aaron Colwell on 2013-12-17 (public-html-media@w3.org from December 2013)

From: Aaron Colwell <acolwell@google.com>
Date: Tue, 17 Dec 2013 09:21:36 -0800
To: Charles McCathie Nevile <chaals@yandex-team.ru>
Cc: "public-html-media@w3.org" <public-html-media@w3.org>
Message-ID: <CAA0c1bCAAv+yX6Ec94UStSbwaFALkH2UsaAJNqdA0_gGDjnh6g@mail.gmail.com>
Hi Charles,

Thanks again for taking the time to respond.

I think there might be some misunderstandings about what the section in
addSourceBuffer() is actually restricting. It is only saying that an
implementation has to support at least one MediaSource object and that each
object at a minimum must support 1 audio and 1 video track. It does not put
any restrictions on how many HTMLMediaElements are allowed to be used on
the page so you could still use mediagroup to satisfy your use case.

All of my comments around modifying HTML5 were made under the assumption
that you were suggesting the addition of a new requirement to the
HTMLMediaElement that would allow multiple selected video tracks in a
single HTMLMediaElement. Based on your responses it appears that I was
mistaken about what was being said. I'm sorry about that.

In general, I think MSE should be thought of as an alternate network layer
that the HTMLMediaElement functionality sits on top of. Just like existing
file parsers may not bubble up all tracks to the HTMLMediaElement, minimal
MSE implementations are only required to surface 1 audio and 1 video track.
This restriction does not effect higher level functionality, like
MediaControllers, that sit on top of HTMLMediaElement. Admittedly there is
some subtlety here, but I would argue that this already exists in HTML5
today w/o MSE.

Hopefully this things clears up. Does this change your position on the note
in any way?

Aaron



On Tue, Dec 17, 2013 at 2:17 AM, Charles McCathie Nevile <
chaals@yandex-team.ru> wrote:

> On Mon, 16 Dec 2013 22:22:56 +0100, Aaron Colwell <acolwell@google.com>
> wrote:
>
>  Comments inline..
>>
>
> Ditto.
>
> I apologise for the length of this mail, especially as it seems to me to
> be repeatedly answering the same few questions phrased slightly
> differently. But I felt it was important not to leave something apparently
> unanswered just because it didn't seem to introduce anything new to me.
> Which has meant I haven't had the time to shorten this for easier
> readability. I believe that there is a meeting today at which this topic
> may come up, so I wanted to provide the input in time for that.
>
>  On Mon, Dec 16, 2013 at 7:42 AM, Charles McCathie Nevile <chaals@yandex
>>
>>> On Thu, 12 Dec 2013 20:25:42 +0100, Aaron Colwell <acolwell@google.com>
>>>
>>
>  I too would like to find a good balance. I believe accessibility is
>> important, but I want to make sure that the text we add to specs actually
>> results in making things more accessible and not just be lip service.
>>
>
> Yes, I think we are on the same page with that.
>
>
>  I could just agree right now and blindly add in the text but it wouldn't
>> necessarily result in actual accessibility improvements in
>> implementations nor clarify things for implementers that want to "do the
>> right thing."
>>
>
> The Call for Consensus was framed about as open as I could figure it while
> keeping it clear what we were looking for, because I think that "blindly
> adding some text" would indeed not be the smart approach.
>
> On the other hand, I think the accessibility community is compromising a
> fair bit to walk away from requirements that support practical implemented
> approaches, and instead looking for something that provides a reasonable
> balance.
>
>   On Thu, Dec 12, 2013 at 10:28 AM, Paul Cotton <Paul.Cotton@microsoft.com
>>> >
>>>
>>>  See the extract from the A11Y TF IRC log below in which I made some of
>>>
>>>> the points in your response during the A11Y TF discussion this
>>>>> morning:
>>>>>
>>>>> http://www.w3.org/2013/12/12-html-a11y-irc
>>>>>
>>>>
>>>> Thanks. I appreciate this.
>>>>
>>>>  Does HTML5 have a similar note?
>>>>
>>>>>
>>>>> The TF plans to open a bug on HTML5 to cause this to happen.
>>>>>
>>>>
>>>> Ok. That seems like the proper path forward to me.
>>>>
>>>
>>> If you look at the log, you will further note that the reason for
>>> raising this against MSE first is that MSE is likely to ship well
>>> before HTML.
>>>
>>
>> So I feel like there are 2 parts to this.
>> First, if this type of accessibility is a true core value of the W3C it
>> seems like HTML should not be able to ship w/o this.
>>
>
> Indeed. I would expect very strong objection at the AC level if HTML
> simply ignored the use case. (After all, some people pay their membership
> fee specifically to work on accessibility in W3C - I can think of at least
> a dozen members where that is pretty close to their only reason for being
> in W3C...
>
>
>  Based on the accessibility discussions I've observed during the ~2 years
>> I've been
>> participating in the W3C, I know this is a contentious topic I don't
>> really want to rile people up again.
>>
>
> No. And I certainly don't want to go back to the bad old days of
> intractable "he-said-she-said" discussions that look suspiciously like the
> participants aim to score debating points rather than improve things for
> users.
>
>
>  Second, sign-language video tracks support has not been specified
>> anywhere to my knowledge so it is unclear what requirements this
>> actually places on MSE.
>>
>
> As I understand the MSE spec, and Adrian's explanation of why 23661 "Works
> for Him", it already explains how to handle multiple sourceBuffers and
> doesn't constrain them *not* to be e.g. 2 or more video tracks.
>
> Section 2.4.5 refers to the HTML5 spec's concept of "selected Video" but
> that apparently doesn't contradict in either MSE or HTML5 with the ability
> to use media controllers (as elaborated on below). In fact I couldn't find
> any practical impact of the selected video concept at all, beyond a DOM
> attribute that is only true for one video at a time.
>
>
>  I understand that politics and the desire to ship likely prevents this
>> from being added to the HTML5 train, but if anything this should be placed
>> in an extension spec so that other specs like MSE can evaluate
>> how to properly integrate with this functionality. It is not clear to me
>> that a simple note saying that more than 1 video track needs to be
>> supported to handle sign-language tracks is enough. At a minimum you'd
>> need to specify how multiple video tracks being selected at one time
>> should work since the current HTML5 text doesn't even allow it.
>>
>
> Actually, it is not only allowed but supported via the mediagroup element.
> [mg]
>
> At least one well-known source information about HTNL specifications
> [w3school] describes it claiming support in 5 browsers, and I tested it in
> mine (which is not one of the five) where it worked fine.
>
>
>  That sort of information is required to properly update the algorithms
>> in the MSE spec to support this use case.
>>
>
> If that were true, Adrian's "Works for Me" claim on the bug should be
> changed until we figure out if it does work. But it seems you're already
> doing far better than you claim, and that Adrian's claim is justifiable as
> a technical response.
>
>
>  There are likely many other details that would need to be ironed out
>> before it is clear how to properly enable support for this in MSE.
>>
>
> Having looked carefully, I cannot find any details in the algorithms that
> would need to be changed. Which accords with my understanding that this can
> already be supported in practice. Of course I am open to explanation of
> what specific things would not work in MSE, and looking at how to deal with
> such issues.
>
>
>  I object to adding this note to the MSE spec. This is an attempt
>>>>>> to give weight to an accessiblity issue that should be solved by
>>>>>> the spec that defines HTMLMediaElement behavior (ie HTML5 &
>>>>>> HTML.next) and not an extension spec that is simply providing an
>>>>>> alternate way to supply media to the HTMLMediaElement.
>>>>>>
>>>>>
>  Enabling a superior experience for users is a laudable goal. Indeed, it
>>> is also at the core of accessibility as understood at W3C.
>>>
>>> A general part of W3C's claims about its technology is that they work
>>> for all people, regardless of disability - which in this case I believe
>>> one can reasonably interpret as "…including those who require signed
>>> captioning and other advanced potential sourceBuffers to be delivered
>>> to the HTMLMediaElement".
>>>
>>
>> In my opinion, this is too strong of a claim for the W3C to make
>> credibly. It completely ignores the constraints of actual implementations.
>>
>
> Does it? Your Google colleague outlined a strategy YouTube apparently
> proposes to ship and claimed that the code supported the objection. I
> outlined another strategy, implemented in running code in a real
> University. Adrian already claimed that this can be done, as did James
> Craig within the HTML Accessibility Task Force.
>
>
>  I believe it is a great goal and we definitely should work towards
>> enabling access in any way that we can. I believe this is best done by
>> first attacking this problem at the HTMLMediaElement level. This could be
>> inside an HTML spec or a new extension spec. Either way, we need to
>> define how the element deals with these new use cases before we can
>> determine how MSE needs to be changed.
>>
>
> Do you disagree that media controllers allow for this use case in the
> current HTML specification? (The current specification may not be ideal,
> but it appears to me that James and Adrian and W3Schools are correct and
> this can already be done).
>
>
>  I'm happy to update MSE when this behavior is defined, but until
>> then, I don't really think that such a note provides much value to
>> implementers or guidance for content authors.
>>
>
> Until when? Was Adrian wrong to close the bug WfM, are others wrong in
> pointing out that HTML supports multiple videos through media controllers?
>
>
>  Without arguing for the TF’s request I do want to point out they are only
>>> asking for the addition of a non-normative Note.
>>>
>>>>
>>>> I understand, but I don't think we need to add an informative note in
>>>> MSE indicating how multiple tracks would be useful. In my opinion this
>>>> is a quality of implementation issue and if implementations want to
>>>> make MSE content accessible then they will support more than the
>>>> minimal requirements.
>>>>
>>>
>>> It is normal for W3C specifications to support maximal accessibility
>>> "out of the box", since access for all is one of W3C's core values. A
>>> specification which did not do so and required special unexplained extra
>>> implementation to support basic accessibility use cases would be
>>> reasonably likely to attract formal objections.
>>>
>>
>> HTML5 and/or HTML.next does not appear to support this "maximal
>> accessibility" right now. Are there formal objections for this?
>>
>
> No, and as explained I believe they are unnecessary because I believe your
> colleagues (rather than you) are correct and HTML *does* support this use
> case right now. If that really isn't the case I certainly expect formal
> objections to the HTML5 specification in due course.
>
> (In the more general sense, yes there are outstanding objections against
> HTML5 where it does currently fail to support accessibility, but given that
> it is at a different stage of advancement that isn't actually relevant
> right now).
>
> I have also seen the use case of multiple synchronised video tracks
> running in multiple different HTML systems, and while I realise that "a
> demo can work" isn't the same as "it really works" I am quite surprised to
> hear you contradicting Adrian's statement in particular.
>
> If it is really true that this doesn't work, I think we should learn why,
> and importantly how others who are actually editing on the same spec come
> to different conclusions.
>
>
>  It seems to me that supporting sign-language tracks is also an
>> "unexplained extra implementation" that doesn't appear to be defined
>> anywhere. The note does not appear to actually improve the situation.
>>
>
> I am working on the assumption that HTML5 does support multiple
> synchronised tracks.
>
>
>  The proposed resolution of the Task Force assumes there will be shipping
>>> implementations incapable of supporting these use cases, but
>>> nevertheless useful in more restricted environments. It also assumes
>>> that there are people who expect to support accessible use cases by
>>> default, and indeed to look for solutions which do so as a matter of
>>> preference. It recognises this as a quality of implementation issue.
>>> It does not assume that the *only* way to provide high-quality
>>> accessibility is through the use of MSE. It merely requests that the
>>> specification acknowledge that a minimally conforming implementation
>>> may not satisfy certain use cases.
>>>
>>>
>> If I add a note along the lines of "The minimal requirement of 1 video
>> and 1 audio track may not be sufficient to support accessibility use
>> cases like sign-language or audio description tracks.", how does this
>> help? It may cause people to think that these use cases could not be
>> supported with MSE on these restricted implementations, which is not
>> true.
>>
>
> Agreed, although that would force partiular implementation strategies that
> I think it is unreasonable to assume are appropriate for all uses,
> particularly given the concrete evidence that implementors felt it
> necessary to support use cases that would be incredibly difficult to do
> with that approach.
>
>
>  You could still use MSE to display sign-language or alternate audio
>> even if only one track of each type is allowed. It seems like the
>> "may" here leaves too much open to interpretation and this note could
>> end up simply being a lie and prematurely scare people off.
>>
>
> Agreed. Getting the text correct is important, and I am happy to work on
> ensuring we do so, if you have accepted the request for some
> acknowledgement that the minimal configuration are not necessarily
> sufficient, and that higher quality implementations may better support use
> cases considered important.
>
>
>  What is the goal here?
>>
>
> To ensure that those reading the spec (e.g. implementors, people who are
> basing purchase requirements on it, and people who are using it as a source
> teaching others) are aware that some use cases are likely to require in
> practice the "higher quality" implementations that offer configurations
> beyond the minimum requirement listed for conformance to the specification.
>
>
>  How does this actually improve accessibility?
>>
>
> By avoiding the mistake people often fall into of *assuming* that
> following the spec equates to getting everything they want. Setting the bar
> for conformance to the specification so low that it automatically excludes
> certain legitimate use cases (I assume you are not arguing that the use
> cases are not legitimate, since you haven't so far suggested that) suggests
> that a responsible thing to do is to clarify in the spec that this is the
> case.
>
>
>  Indeed, the simplest method of satisfying it I can think of is adding a
>>> note on the addSourceBuffers method, after the definition of minimum
>>> requirements, pointing out that for some use cases, including
>>> accessibility-related ones such as signed video captioning, additional
>>> capability is necessary.
>>>
>>
>> While I agree that this is the simplest and the likely path for
>> consensus, I worry that " additional capability is necessary" is not
>> really helpful to the reader. There are no references to specs that
>> indicate what additional capability is actually needed.
>>
>
> I expect the editors, being very familiar with the specification, could
> help us improve any suggestion to make sure it is true and includes any
> necessary pointers or qualifications.
>
>
>  Perhaps this could be outlined in extension specs to MSE, but for
>> now, I don't really see how this improves things.
>>
>
> I don't think an extension specification to MSE is necessary, and I think
> there is good reason to believe that resolving the discussion that way will
> lead to a very expensive and poor quality outcome.
>
>
>   I think people are reading too much into the 1 audio track and 1 video
>>>
>>>> track requirements. The primary purpose of these 2 bullet points were
>>>> to make sure that both "multiple tracks per SourceBuffer" and"multiple
>>>> SourceBuffers with a single track" must be supported by implementations.
>>>> The 1 track requirement is simply a reflection of the fact that many
>>>> devices will only be able to support these 2 configurations.
>>>>
>>>
> For some value of "many".
>
> Rather than arguing whether such essentially legacy devices conform to the
> requirements for a modern web, we're not trying to change the requirement.
>
> But it is one thing to suggest that Opera Mini is a useful tool for
> accessing the Web (it is, for zillions of users) and another to say "so it
> does everything you need on the web, with a feature phone" (which might
> appear in press releases, but is demonstrably misleading in normal
> parlance).
>
> We're asking for the specification to err on the side of being a
> technically clear document, rather than a PR piece.
>
> At the same time, we are asking to work with the editors and Working Group
> to ensure that relevant statements describing limitations introduced or
> implied by the spec are indeed accurate.
>
>
>  Obviously if a UA was able to support sign language video tracks, then
>>>> they would go beyond the minimal requirements.
>>>>
>>>
>>> It is not a priori obvious that a conforming implementation of a W3C
>>> Recommendation is unable to support basic use cases for accessibility.
>>> It is certainly not the general message that W3C promotes with regard to
>>> its specifications.
>>>
>>
>> I feel like MSE is being held to a higher standard here just because it
>> expresses a reality that HTML5 doesn't fess up to.
>>
>
> I don't think so. And as I note, if I am mistaken and HTML5 really doesn't
> support the use cases I don't think you should assume that it won't be held
> to the same standard.
>
>
>  What if I modified the text so it ignored this reality and only said
>> something along the lines of:
>>  "If an implementation supports a specific combination of N tracks in a
>> single SourceBuffer, then it also must support the same N tracks
>> distributed across M SourceBuffers where M > 1 and  M <= N."
>>
>
> To be honest, my likely reaction to such a change would be to become
> somewhat less confident about your willingness to look collaboratively for
> sensible solutions to identified problems which you profess to want to
> solve.
>
>
>  Would the need for the note go away?
>>
>
> I don't think so. Although it makes the spec far harder for anyone to
> understand, with no apparent benefit to anyone, it *seems* to me on first
> read that tt effectively doesn't support the use case, nor acknowledge that
> the use case is insupported.
>
>
>  This would bring MSE into equal vagueness with HTML5 on this issue
>> and would not exclude your use cases.
>>
>
> As stated earlier, I believe HTML5 does actually support the use cases,
> and that this assertion would therefore be untrue.
>
>
>  I would prefer making this normative change instead of an informative
>> change that I believe would have little impact on making content more
>> accessible.
>>
>
> I hope you decide that retreating into obscurity is not a good approach to
> dealing with issues.
>
> I do not want to insist on a particular text that is less than honest, nor
> one that is sufficiently unclear that it is effectively misleading. I want
> to work constructively to help clarify to users of the MSE specification
> what is required to support use cases that I believe it and HTML do
> actually support.
>
> [mg] http://www.w3.org/html/wg/drafts/html/CR/embedded-
> content-0.html#synchronising-multiple-media-elements
> [w3school] http://www.w3schools.com/TAGS/av_prop_mediagroup.asp
>
>
> cheers
>
> Chaals
>
> --
> Charles McCathie Nevile - Consultant (web standards) CTO Office, Yandex
>       chaals@yandex-team.ru         Find more at http://yandex.com
>
Received on Tuesday, 17 December 2013 17:22:07 UTC