[minutes] telecon 7 April 2011 from Dan Burnett on 2011-04-08 (public-xg-htmlspeech@w3.org from April 2011)

From: Dan Burnett <dburnett@voxeo.com>
Date: Fri, 8 Apr 2011 05:59:15 -0400
To: public-xg-htmlspeech@w3.org
Message-Id: <557619A5-FF33-4D43-83D0-D46942A21417@voxeo.com>
The minutes are available at http://www.w3.org/2011/04/07-htmlspeech-minutes.html 
.

For convenience, a text version is included below.

Thanks to Debbie for taking minutes!

-- dan


Attendees

    Present
           Dan_Burnett, Olli_Pettay, Milan_Young, Michael_Bodell,
           +1.818.237.aaaa, Raj_Tumuluri, Patrick_Ehlen,
           +1.425.421.aabb, Jerry_Carter, +1.425.391.aacc, Robert_Brown,
           Dan_Druta, Debbie_Dahl, Bjorn_Bringert, Michael_Johnston

    Regrets
    Chair
           Dan Burnett

    Scribe
           Debbie Dahl

Contents

      * [4]Topics
          1. [5]f2f logistics and planning
          2. [6]open questions about proposals
      * [7]Summary of Action Items
      _________________________________________________________


f2f logistics and planning

    bjorn: several people have asked for rooms
    ... is there anyone else?

    dan: I will need a room

    bjorn: I need the maximum number of days that you'll stay there.
    ... is anyone opposed to a better hotel, costs 7 GBP more?
    ... i will switch us to a better hotel
    ... will send out a form to see how many are coming

    dan_druta: will come and let you know.

    raj: will come

    bjorn: nothing else about arrangements

open questions about proposals

    danB: new person

    patrick: Patrick Ehlen from ATT

    dan: for each proposal would like to hear a quick summary of what
    your proposal does and doesn't do with respect to the other
    proposals.
    ... proposers should just take the floor and discuss, even if other
    proposers may want to make a correction.
    ... bjorn starts.

    bjorn: MS proposal, there aren't a lot of commonalities between ASR
    and TTS.
    ... is that correct?

    danB: will discuss later

    bjorn: MS includes both a Javascript API and a browser-server
    protocol
    ... would like to break these apart
    ... sums up MS proposal, but thinks that MS API and Google proposal
    could be merged.
    ... Mozilla proposal is similar to Google, but Mozilla doesn't allow
    user-initiated recognition without a permission prompt, but Google
    does, and this is an important use case for us.
    ... the proposal for the WebApp API could say what implementation is
    used, and there could be a different proposal for how the browser
    talks to that implementation.

    olli: how does Google's proposal do that without click-checking?

    bjorn: the browser must make it clear that it's starting recognition

    dan: click jacking or click checking?

    bjorn: should be click jacking, not checking
    ... there should also be click checking to make sure that it was
    really the user.

    dan: switch to olli's discussion now.

    olli: the differences are minor. wasn't thinking much about the
    network engine.
    ... about Google's proposal it seems that it would be difficult to
    handle multiple fields at once
    ... that's one reason why X+V was so difficult
    ... wouldn't like to bind recognition results to one input field
    ... also, user-initiated recognition, i don't see the difference if
    the user is clicking something and that starts recognition, that
    could be ok at first, so I don't see the difference between Mozillas
    and Google's proposals.
    ... MS proposal using Web Sockets is minor but could be good if we
    want to allow remote speech engines

    milan: question for Olli, you said that we must handle
    click-jacking?

    <burn> that was Milan

    olli: not sure how Google's proposal handles this

    milan: in summary, you don't find that sufficient?

    olli: no

    robert: I agree that if you look at high level scripting API the
    proposals are similar.
    ... the high level speech semantics are very similar and we should
    be able to converge pretty easily.
    ... there are only so many ways to build a speech API
    ... one of the things that we're trying to achieve is to allow a lot
    of openness so that the ASR and TTS is not determined by the
    manufacturer of the browser.
    ... one thing I'm concerned about with Google and Mozilla is that
    there's an intent to handle that later, but I think it needs to be
    handled now. we need in the first version to handle some
    interopability.
    ... what could we do to provide a simple protocol with existing
    API's. we proposed XHR, but Web Sockets would be fine. we wanted to
    say that it's not a hard problem.
    ... the second comment is that we tried to take a scenario-focused
    approach. our document specified a half-dozen or so apps, and tried
    to think about requirements.
    ... this is why we put ASR and TTS into the same spec
    ... there are a number of scenarios that would be difficult if the
    speech was just built into the browser
    ... a comment about user-initiated speech. we're skeptical that just
    having a button that the user pushes insures privacy.
    ... there will be many kinds of devices, we believe that consent
    should be built into the browser implementation.
    ... we don't want the speech API to be a de facto microphone API, we
    should provide microphone requirements into an existing effort.
    ... on the question of v1 vs v2. we aren't opposed to a second
    version, but we don't want v1 just to be the easy things, it should
    include the important things.
    ... regarding TTS, it takes the things that seem to work from the
    media element, but not the things that don't apply, like multiple
    tracks.

    dan: questions for Robert?

    milan: robert, how does your proposal handle a default recognizer?

    robert: you use a constructer without that parameter, if there are
    multiple recognizers availble you could use those parameters to
    select an appropriate one.

    danB: this disussion will be more unstructured and open. next week
    we'll have a more structured discussion. first bjorn will get a
    chance to respond.

    bjorn: regarding olli's point about multiple input fields, you could
    do that with scripting, or we could use something like MS.
    ... (missed comment about random selection)
    ... on the question of whether clicking implies consent, we say that
    clicking could imply consent, but there could be other ways. Also
    agree that other engines could be used, but one way to do that would
    be, for example, Nuance, to write a plugin.
    ... you could have a Javascript API with a parameter that says which
    engine would be used.
    ... would like some clarification on what use cases couldn't be
    supported by default recognizer
    ... we agree that we don't want to work on microphone API

    dan: the floor is open. question for bjorn about the click-to-speak
    issue
    ... there could be a button to click but that doesn't necessariy
    imply consent. it is still the browser's responsibility to insure
    consent.

    bjorn: a button could insure consent.

    dan: the browser could even treat lack of clicking for consent with
    some use cases.

    raj: another use case for not using the default recognizer might be
    if you have an SLM, which aren't interoperable.

    danB: does Google.com want the default recognizer in IE to be the MS
    recognizer?
    ... individual sites may have a strong preference for a recognizer
    to be used.

    robert: for example, Nuance have a lot of enterprise customer care
    speech applications, and customers will want to leverage that
    investment.

    danD: if the web developer wants to specify an engine they should be
    able to do that. the browser should provide a default. also the user
    should be able to specify a recognizer.

    danB: if the user has asked for another recognizer, then web
    application should be able to not render.

    robert: we've already agreed to this

    bjorn: doesn't disagree

    jerry: what about local resources?

    bjorn: everyone agrees on that

    jerry: many free-form grammars would only work with certain engines

    bjorn: we have broad agreement. with MS proposal we could split
    control of recognizer from selection of recognizer.

    robert: in principle that would be reasonable, but don't want to
    lose track of one of those topics.

    danB: what we do with TTS and ASR should be synchronized.
    ... some use cases only involve TTS, for example.

    <Milan> Milan: reluctant to split the solution (tts, asr, protocol)
    into many documents because vendors may choose to implment only
    select pieces

    bjorn: two different things, ASR vs. TTS and web app vs. server
    ... does anyone have concerns about splitting?

    milan: only that browsers might cherry-pick specs

    robert: would not ratify one spec if the other wasn't satisfactory

    michaelB: if they were together it would be easier to keep things in
    synch.

    bjorn: the web app api could be done, and then the server-side one
    could depend on that.

    michael: the questions about synch and ratifying at the same time
    argue for one proposal.

    danD: if we had two efforts it would speed up adoption but it still
    should be one spec.

    bjorn: there should be a single API for the web app and another one
    for how the browser talks to the engine.

    milan: I had a proposal for a way to unify the Mozilla proposal and
    MS proposal by using macros over the MS proposal to make it look
    more like Mozilla.

    bjorn: that's mostly syntactic

    <burn> Milan's email (thread):
    [8]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Mar/
    0040.html

       [8] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Mar/0040.html

    milan: the MS proposal talks to the server in the web app

    danD: we should move away from syntax to more declarative types of
    statements.

    robert: not sure, for example, in SALT, html elements were simple,
    but then you had to write a lot of scripting

    bjorn: this is a different discussion

    michaelJ: it's important to keep specification of API to the server
    because there will be a lot of overlap between the API's. we don't
    want to end up with different names for things.

    milan: but we had separate specification, they should be ratified
    together.

    bjorn: that seems reasonable, and they should be developed in
    parallel
    ... they could be separate so that people who write webapps only
    have to look at one thing. Also, they could go to different
    standards organizations.

    danB: at IETF, talked about real-time collaboration between web
    browsers (RTC web) won't be working on new protocols. the interface
    from browser to engine will introduce some requirements.

    bjorn: there could be several protocols for talking to servers, so
    that more could be added later
    ... for example, VoiceXML and SRGS aren't in the same spec

    milan: we agreed that there should be a protocol for communicating
    to a speech server.

    danB: there could be a "mandatory to implement" requirement. any web
    app API is not complete unless it includes a "mandatory to
    implement" requirement for server communication that is defined by
    this group.
    ... we should begin to do this because our requirements are
    different.

    robert: VoiceXML/SRGS analogy is different because SRGS can be used
    independently. both are tightly coupled.

    bjorn: web app API makes sense by itself and also server API
    ... we are implementing both of those at Google. We have non-browser
    clients that use the server API

    milan: how about an MRCP over HTTP protocol?
    ... are people familiar with MRCP?

    bjorn: seems a lot more complex than MS proposal

    milan: MS is a simplified version of MRCP

    robert: that is kind of what we've done, could also do MRCP over Web
    Sockets.

    raj: MRCP is a good idea, because it's already been implemented, but
    wouldn't it be overkill for a local system?

    milan: most OS's would optimize that

    bjorn: it's more than just efficiency

    milan: talking about using MRCP paradigm, not full MRCP

    dan: MRCP is a protocol that just controls ASR and TTS resources.
    MRCP v2 makes use of SIP to set up and MRCP session, but from then
    on all communication is MRCP. milan is talking about the MRCP
    protocol itself, which doesn't require SIP.

    milan: would be willing to stage this.

    jerry: MRCP in the browser is very messy.

    robert: could we layer MRCP over Web Sockets

    milan: i'm not suggesting that developers would program to MRCP, in
    a web app you would have to have simpler concepts, or the browser
    could support it, which would totally mask it from the developer.

    danD: it should definitely be abstracted from the web browser
    ... it will enable both weekend and enterprise developers to use the
    spec

    dan: any other topics that require discussion?

    danD: the proposals lack clarity around privacy, preferences and
    consent.

    bjorn: they should be up to user agents

    danD: we need to put some mandates on the developers of user agents.

    robert: for example, a way to indicate to the recognizer that it
    shouldn't log?

    danD: yes, should have a very clear indication of what the user can
    specify or override in regard to the speech interaction

    robert: it depends highly on the user agent itself. a cell phone is
    different from the dashboard of a car or one that's being used by a
    blind person. I don't feel comfortable mandating something.

    danD: for example, where do we display what engine is being used? do
    we want to have a consistent way for the user to specify their
    profile?

    danB: it seems clear that we must address this topic in a
    specification

    bjorn: about protocols vs. web api's.

    <Robert> we also need to discuss microphone API

    bjorn: there was discussion about protocols, but we didn't talk much
    about web api's. we pretty much agree on web api's.

    michaelB: not sure about the details.

    michaelJ: agree, details need to be worked out.

    bjorn: yes, but we seem to agree on high level.

    robert: we seem to be moving in the direction of a JavaScript API,
    although not in HTML, or the protocol.

    bjorn: if we start on the web api, there are a lot of things we
    could agree on.

    danB: major issues need to be worked out early during process, but
    it's also good to be able to make progress. so we need to be able to
    do both at the same time.
    ... that is, discuss big issues and work out details of things we
    roughly agree on.

    michaelB: agree, this is a reason it's useful to have things in the
    same document.

    danB: Michael and I will talk about how to structure discussion,
    e.g. write down things we agree on.
    ... it might be too soon to work out details of proposals.

    robert: one thing we don't agree on is microphone api.

    milan: also result format

    <smaug_> (we may not need to think about microphone if we move to
    use audio streams)
Received on Friday, 8 April 2011 09:59:45 UTC