W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > May 2011

[minutes] 5 May 2011

From: Dan Burnett <dburnett@voxeo.com>
Date: Wed, 11 May 2011 06:05:23 -0400
Message-Id: <C2F8EA17-CDE3-4C33-ADE9-31B496AC4331@voxeo.com>
To: public-xg-htmlspeech@w3.org
Group,

The minutes are available at http://www.w3.org/2011/05/05-htmlspeech-minutes.html

For convenience, a text version is below.

Thanks to Charles Hemphill for taking minutes!

-- dan

Attendees

    Present
           Dan_Burnett, Michael_Bodell, Bjorn_Bringert, Robert_Brown,
           Olli_Pettay, Charles_Hemphill, Patrick_Ehlen, Dan_Druta,
           Michael_Johnston, Raj_Tumuluri

    Regrets
           Debbie_Dahl, Marc_Schroeder

    Chair
           Dan_Burnett

    Scribe
           Charles_Hemphill

Contents

      * [4]Topics
          1. [5]F2F Logistics: Any updates on attendance, hotel
             bookings, and questions or details from Bjorn.
          2. [6]Review new text in updated "Final Report" document
             [$1\47] to ensure it matches what people think we agreed
             upon in our last teleconference.
          3. [7]Determine if we already have other agreed-upon design
             decisions.
          4. [8]Begin discussing issues listed in the Appendix.
      * [9]Summary of Action Items
      _________________________________________________________

    <burn> trackbot, start telcon

    <trackbot> Date: 05 May 2011

    <burn> Scribe: Charles_Hemphill

    <burn> ScribeNick: Charles

    <burn> Agenda:
    [10]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011May
    /0001.html

      [10] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011May/0001.html

F2F Logistics: Any updates on attendance, hotel bookings, and questions
or details from Bjorn.

    Bjorn: no updates on F2F

    Burn: will send out schedule in the next few days.

Review new text in updated "Final Report" document [$1\47] to ensure it
matches what people think we agreed upon in our last teleconference.

    <burn> document is
    [11]http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech
    -20110503.html

      [11] http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110503.html

    Burn: comments on the document - added general design decission - 17
    new discussion bullets.

Determine if we already have other agreed-upon design decisions.

    Bjorn: discussion topic about mic capture access. Propose design
    agreement - should be possible to start speech reco without
    selecting mic - just pick default.

    Burn: default vs. what you can do - two things.

    Bjorn: There should be a default mic. Perhaps the only option.

    Born: saying explicit determination of mic should not be required.

    Bjorn: Should not need to enumerate mics before starting.

    Robert: Think we let you mic other mics.
    ... that's a reasonable interpretation.
    ... By default, mic provided by user agent default device.

    Bjorn: Need to discuss second sentence later - picking a mic.
    ... should be able to start reco without selecting mic - confirming
    agreement.

    Robert: Assuming that the default will be used for mic.

    Burn: notion of default mic.

    Robert: Issue of user interface. Shows speaker activity. Is there a
    default user interface? Can the application override.

    Bjorn: Have that requirement for default user interface.

    Robert: RE: default user interface - shows it's listening and lets
    user cancel.

    Olli: What is the default user interface. Something in the browser.

    Bjorn: Should only user browser user interface. No Web app user
    interface.

    Olli: More security or privacy concerns otherwise.

    DanD: worried about limitations of only in the browser.

    Robert: Don't think that's true.
    ... Default user interface. Can it be overridden. Where does it
    live. 3 discussions.
    ... Google right in the Web page where the user clicks. Up to user
    agent to decide how to render.

    Bjorn: Have a default interface now.

    MichaelJ: Fine for default. Want APIs to allow someone to build
    their own. Different user experience. Allow this. Useful to have
    default. But now always appropriate.

    Bjorn: Have agreement on default. Have disagreeemnt on your own due
    to security reasons, etc.

    MichaelJ: Very limiting otherwise.

    Bjorn: Should start speech by custom ways including JavaScript. Can
    hide that you're capturing audio if custom UI.

    Robert: Compromise - default UI parameterized? Provide feedback to
    the user. Style sheet. Look at customizations.

    MichaelB: Up to user agent to allow customization. Part of
    permissions API.

    Burn: Should be a default user interface.
    ... Should there be customization and what level.

    DanD: Not all use cases in browsers. Different security concerns if
    rendering engine used. Should not be forced by HTML spec to have a
    particular UI.
    ... Don't want to prevent annimated character app that is listening
    to you.

    Bjorn: Talk about browser case. Need to be clear tha the browser is
    capturing the audio.

    Dand: COuld be a matter of security settings.

    Bjorn: Don't say that we disallow customization, but don't require
    this.

    DanD: End up with fragmentation. WOn't work cross browser.

    Bjorn: Allow for non-browser apps.
    ... Note for future discussion.
    ... Allow customization of the user interface that show audio
    capture is happening.

    Burn: Have a discussion topic of the level of customization allowed.

    Bjorn: SHould have customization for the UI for starting
    recognition. Have discussion topic: customize UI for showing that
    audio is being captured.

    MichaelJ: Waveform, traffic lights?

    Bjorn: Can app customize what the app looks like?

    MichaelJ: Can customize one that show up in the UI.
    ... Multimodal tap and talk API. Want creativity. Activate
    recogntition button. DOn't want to rule out certain kinds of APIs.
    Dont' want built-in browser feedback to interfere.

    Burn: come back to this discussion later.

Begin discussing issues listed in the Appendix.

    Burn: Have time to discuss a serious topic. Can work out serious
    issues at FTF.
    ... Determine which topics have more meat. Start with audio.
    ... 3 audio related topics. How to get audio capture access.
    Manditory audio codecs. Audio streaming support and how.

    Bjorn: 1st unrelated to 2nd two. 1st is API. 2nd two how audio is
    sent form browser to implementation.

    Burn: How to get audio mic capture access.

    Bjorn: MS proposal has mic selection. What are use cases?

    <burn> "audio mic capture" is "audio/mic/capture"

    Robert: Browser going to have mic API anyway. Avoid 2 mic APIs. 1 in
    speech and anothe unrelated (explicit). Want speech API to integrate
    with browser API.
    ... Many devices will have mult. mics. Improtant to select the one
    you want. Maybe app or user through prefences.
    ... May want to configure mic settings. Use for things other than
    speech. E.g. video app that does speech reco.
    ... MS API allows this. Can get audio strem to reco. Look at
    multimodal scenarios. Need for integrated API there. Speech API
    should integrate.

    Bjorn: Can buy most of that.
    ... If there is one there, should be able to use for speech. But no
    such standard API yet.

    Robert: Pushing capture API heavily. With michael. IE team thinks
    this is a sound approach.

    Burn: Agree ability to select diff. audio sources.

    Robert: Not quite it. If browser has mic API - we should be able to
    use it.

    Bjorn: Agree. But if not one, don't want to come up with one
    ourself.

    Olli: agree.

    Bjorn: If HTML standard has one, we should be able to use it.

    Robert: Fine with HTML rather than browser.

    Burn: Meta decision. Use HTML if exists, but not create one.

    Robert: Have requirements for such an API?
    ... Latest draft doesn't have notion of stream of endpointing. And
    we care deaply about these for mic API.

    Bjorn: Why does mic API need endpointing?

    Robert: Can be a long way between mic and endpointer.

    <burn> should "stream of endpointing" be "stream or endpointing"?

    Bjorn: Requirement that endpointing be available for things other
    than speech.

    Michael: Hopefully, have agreement - will work with people designing
    the API and express requirements.

    Bjorn: Seems fair.

    Olli: Capture API in HTML draft or draft working group.

    Robert: Mean the one in the DAP working group.

    Bjorn: Think we should work with HTML.

    Burn: 2nd one tricky. Wrote we will capture an express requirement
    on a capture API to relavent groups.

    Bjorn: Seems reasonable. Avoid "capture".

    Burn: requirements on audio capture APIs.
    ... requirements on all audio capture APIs.

    Bjorn: seems fine.

    <mbodell> Olli, is there a capture API in the w3c HTML draft? I
    don't see it at [12]http://dev.w3.org/html5/spec/Overview.html

      [12] http://dev.w3.org/html5/spec/Overview.html

    <smaug> mbodell: I don't read that version of HTML spec ;)

    Bjorn: If no HTML audio capture API. Propose that we proceed even
    without a mic API.

    <smaug> mbodell:
    [13]http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.
    html#video-conferencing-and-peer-to-peer-communication is an early
    draft

      [13] http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.html#video-conferencing-and-peer-to-peer-communication

    <burn> (now Robert is speaking)

    Robert: Concern - browsers will need to implement privacy and
    security policies. Weird to have for speech alone, but not audio
    capture in general. May be messy.

    Bjorn: Forge ahead, and consider audio capture in general.

    Burn: Agreement that's important.

    Bjorn: Having control over audio capture does not have to be in the
    first proposal.

    Burn: Is that the concencus?

    Bjorn: OK to have speech API if there is not an audio capture API.

    Robert: Not create one, and shouldn't be blocked from moving
    forward.

    Burn: Not create one and not block while waiting for one.

    Michael: May design suboptimal if no audio capture API and may not
    fit well once it's there.
    ... Premature to jump to say we can make total progress without
    that.

    DanD: Goal for group to submit the requirements to the other working
    groups. Accelarating the cature API for audio may be one of the
    recommendations. AT&T member of DAP. Recognize needs.

    Bjorn: Agree we should not block this progress while waiting.

    DanD: May create fragmentation.
    ... Unless abstracted completely to "get mic".

    Bjorn: Agreed that we should start reco without specifying mic.

    DanD: Concerned that we should avoid fragmentation.

    Burn: Good to get agreement.

    Dand: API for capture, if we are able to capture the audio without
    web developer going through coding, then we are fine.
    ... If anything specific in the web application to retrieve the
    audio handle, then we're looking for if-then-else statements.

    Bjorn: We would like to do the former.

    Burn: What is meant by "start of speech", "end of speech", and
    endpointing in general? How do transmission delays affect the
    definitions and what we want in terms of APIs?

    Robert: Divide into smaller topics. Distributed env., with speech
    services remote. 2 notions of endpoiting: by reco or cheap on client
    (responsiveness and reduced network IO). Look at these 2 as
    seperate.

    Bjorn: Throw out proposal. Require client-side simple endpointer?

    Robert: Has my vote.

    Burn: No endpointer on my computer.

    Bjorn: Browser could do simple energy-based end pointing.

    Robert: Lots of optinos. GSM encoder has endpointer. Can have local
    reco and use for endpointer.

    Burn: APi needs to assume client as well as server-side endpointer.
    client could be null op?

    Bjorn: Stronger: has to be something in the client that does tell
    start and end of speech. even if not good.

    Michael: Can see recommending. Don't know how web author can know.
    requirement is low latency. doesn't matter after that.

    Bjorn: Agree with that. But if app points to specific recognizer,
    can interact.

    Burn: Why concerned. Reco can get finicky about input based on
    training. Endpointing is mostly done in advance. Be careful about
    requiring local endpointing. If bad, can affect reco.

    Bjorn: Avoid bad endpointers.
    ... Low latency speech dectection should always be available.

    MichaelJ: But not forced to use it. FedEx example: some query -
    using endpointing from reco - want them to be able to use the
    standard. Client endpointing could cause errors.

    Bjorn: Have some parameters. Make it easier for the app. Think
    you're speaking.

    Burn: Ongoing recognition case - won't use loca endpointer.
    ... plenty of open mic apps - listen for keywords.

    Bjorn: Should be one, but should be possible for app to turn off.

    Robertt: probably want app to turn it on if it needs it.

    Michael: Set a parameter and get it that way.

    Bjorn: Hello world app.

    Charles: Level for feedback - good to be local.

    Burn: Low latency endpoint detector shoudl be available.

    Bjorn: Don't have agreementn if on or off by default.

    MichaelJ: Talking about detection of end of speech or start too?

    Burn: may be big difference.
    ... Want low latency to turn on speech to reco - but don't want it
    to stop.

    Bjorn: we do the opposite.
    ... Start streaming right away, server endpoints, but need to stop
    streaming at some point.

    Robert: very scenario dependent. Need start stop speech event. Start
    when click of button, end matters a lot. Need to have optinos
    available.

    Burn: Forwarding audio to expensive recognizers. Want high accuracy
    on end pointing. Don't want to send audio unless we have to due to
    expense.

    Bjorn: Cutting off audio vs. endpointer. Can not listen for the
    event. Control if endpointing cuts off audio.

    MichaelJ: Need to control when start sending audio to recognizer.

    Burn: Start speech adn reco can be different.

    MichaelJ: If reco on for a long time, may want to do something do
    delay until there is certainty of speech.

    Bjorn: Agree tha there is low latency endpointer is available.
    Should be possible for app to decide if audio is started of stopped
    on endpointer.

    Burn: Audio start /stop separate from speech start/stop. Seperatly
    controllable.
    ... Detector detects both start/end of speech and fires an event in
    each case.

    Bjorn: Seperate issue of cutting off audio.

    Burn: Audio to the reco process as opposed to TTS.
    ... Audio start and stop to reco server (resource)...

    Bjorn: Control over which audio is used for speech recognition.
    ... which part of the captured audio.

    DanD: Make sure we carefully agree that we are not forcing the
    application into using the predefined environment engine of the
    browser and still allow developer which engine to use.

    DadD: have a flag. If use optimzied endpointing in application of
    not.

    Bjorn: Seperate from how you choose the engine.

    MichaelJ: Related - if turned on, give some sort of event for local
    prediction of begin/end of speech, is that the resolution we want?
    If level dectector, can also get level?

    Bjorn: Ahould be a more precise way to get actual events from
    recognizer. Level part of mic API?

    MichaelJ: Could be raw energy detector, limited reco listing for
    "silence", etc. for the local part. The browser, client side, can
    have best that it can. Not saying anything about how it's done.

    Burn: May be a difference when there are multiple endpointers. (1)
    low latency - prefilter to decide if goes to reco, (2) high quality
    in engine.
    ... Would want recognizers endpoint detector. But preprocess one is
    the low latency one.

    Bjorn: 2 event : 1 probably vs. actual start/end of speech.

    MichaelJ: Talking now vs. not. More going on underneath. Get
    complicated to expose underneath if varies by implementation. Energy
    level might drive aspects of the API.

    Burn: Why want distinction? Mic open is one option. ANother is that
    engine is paying attention. ANother is that engine found something
    importatnt.
    ... Might decide that it's not hearing anything.

    Bjorn: Started capture, think starting, actually starting. 1st 2 go
    in the UI. Good to have last for timing.

    Burn: In VXML2, have hot word detection. Concluded it doesn't act as
    if speech is detected untell something happens. Acts as if nothing
    happend if nothing reco'd. May collapse 2nd and 3rd states.

    Bjorn: Thought we had agreement earlier.

    Burn: Agreed we had some sort of start and end. Knew we needed to
    discuss it.

    Bjorn: 3.3.3. - onspeechstart/end/error. Need to add more to this
    list.
    ... propose adding onaudiostart onaudioend, and split onspeechstart
    to detected vs. actual (reco).

    MichaelJ: energy vs. reco? split

    Bjorn: Could be confusing.

    MichaelB: onsoundstart?

    Bjorn: sounds like a good name.

    MichaelJ: Issues of calibration? Sensitivity parameters? Used on
    mobile phones or elsewhere. Might need calibration to work well.

    ???: Sensitivity and timeout parameters.

    Burn: Whole topic to discuss parameters.

    Bjorn: Discuss parameters in context.
    ... Agree on adding these events?

    Burn: We will add onaudiostart/end ... Dan will cut and paste here?

    Bjorn: onsoudstart/end shold be low latency. Also say somehting
    about order.

    Burn: OK. audiostart, soundstart, speechstart, speechend, soundend,
    audioend

    Bjron: Might not get soundstart or speechstart.

    Burn: onsoundstart, require soundend.
    ... soundend optional?

    Bjorn: not true.
    ... Can't have onspeechstart without the preceeding two.

    Charles: Want end events with start events.

    Burn: Can have ends all at the same time.

    Bjorn: what if onerror?

    Burn: Great topic.
    ... capture that as issue for discussion.
    ... what happens to audiosound and speech events in case of error.

    Bjorn: Also sensitivity discussion point. And timeout parameters for
    ASR.

    Burn: Meeting next week. Can have call after that. Meeting after
    that. 2 days of meeting.
Received on Wednesday, 11 May 2011 10:05:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 11 May 2011 10:05:53 GMT