- From: Dan Burnett <dburnett@voxeo.com>
- Date: Thu, 12 May 2011 11:01:07 -0400
- To: public-xg-htmlspeech@w3.org
An updated html version of the minutes, with typos fixed, is available at http://www.w3.org/2005/Incubator/htmlspeech/2011/05/05-htmlspeech-minutes.html -- dan On May 11, 2011, at 6:05 AM, Dan Burnett wrote: > Group, > > The minutes are available at http://www.w3.org/2011/05/05-htmlspeech-minutes.html > > For convenience, a text version is below. > > Thanks to Charles Hemphill for taking minutes! > > -- dan > > Attendees > > Present > Dan_Burnett, Michael_Bodell, Bjorn_Bringert, Robert_Brown, > Olli_Pettay, Charles_Hemphill, Patrick_Ehlen, Dan_Druta, > Michael_Johnston, Raj_Tumuluri > > Regrets > Debbie_Dahl, Marc_Schroeder > > Chair > Dan_Burnett > > Scribe > Charles_Hemphill > > Contents > > * [4]Topics > 1. [5]F2F Logistics: Any updates on attendance, hotel > bookings, and questions or details from Bjorn. > 2. [6]Review new text in updated "Final Report" document > [$1\47] to ensure it matches what people think we agreed > upon in our last teleconference. > 3. [7]Determine if we already have other agreed-upon design > decisions. > 4. [8]Begin discussing issues listed in the Appendix. > * [9]Summary of Action Items > _________________________________________________________ > > <burn> trackbot, start telcon > > <trackbot> Date: 05 May 2011 > > <burn> Scribe: Charles_Hemphill > > <burn> ScribeNick: Charles > > <burn> Agenda: > [10]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011May > /0001.html > > [10] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011May/0001.html > > F2F Logistics: Any updates on attendance, hotel bookings, and > questions > or details from Bjorn. > > Bjorn: no updates on F2F > > Burn: will send out schedule in the next few days. > > Review new text in updated "Final Report" document [$1\47] to ensure > it > matches what people think we agreed upon in our last teleconference. > > <burn> document is > [11]http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech > -20110503.html > > [11] http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110503.html > > Burn: comments on the document - added general design decission - 17 > new discussion bullets. > > Determine if we already have other agreed-upon design decisions. > > Bjorn: discussion topic about mic capture access. Propose design > agreement - should be possible to start speech reco without > selecting mic - just pick default. > > Burn: default vs. what you can do - two things. > > Bjorn: There should be a default mic. Perhaps the only option. > > Born: saying explicit determination of mic should not be required. > > Bjorn: Should not need to enumerate mics before starting. > > Robert: Think we let you mic other mics. > ... that's a reasonable interpretation. > ... By default, mic provided by user agent default device. > > Bjorn: Need to discuss second sentence later - picking a mic. > ... should be able to start reco without selecting mic - confirming > agreement. > > Robert: Assuming that the default will be used for mic. > > Burn: notion of default mic. > > Robert: Issue of user interface. Shows speaker activity. Is there a > default user interface? Can the application override. > > Bjorn: Have that requirement for default user interface. > > Robert: RE: default user interface - shows it's listening and lets > user cancel. > > Olli: What is the default user interface. Something in the browser. > > Bjorn: Should only user browser user interface. No Web app user > interface. > > Olli: More security or privacy concerns otherwise. > > DanD: worried about limitations of only in the browser. > > Robert: Don't think that's true. > ... Default user interface. Can it be overridden. Where does it > live. 3 discussions. > ... Google right in the Web page where the user clicks. Up to user > agent to decide how to render. > > Bjorn: Have a default interface now. > > MichaelJ: Fine for default. Want APIs to allow someone to build > their own. Different user experience. Allow this. Useful to have > default. But now always appropriate. > > Bjorn: Have agreement on default. Have disagreeemnt on your own due > to security reasons, etc. > > MichaelJ: Very limiting otherwise. > > Bjorn: Should start speech by custom ways including JavaScript. Can > hide that you're capturing audio if custom UI. > > Robert: Compromise - default UI parameterized? Provide feedback to > the user. Style sheet. Look at customizations. > > MichaelB: Up to user agent to allow customization. Part of > permissions API. > > Burn: Should be a default user interface. > ... Should there be customization and what level. > > DanD: Not all use cases in browsers. Different security concerns if > rendering engine used. Should not be forced by HTML spec to have a > particular UI. > ... Don't want to prevent annimated character app that is listening > to you. > > Bjorn: Talk about browser case. Need to be clear tha the browser is > capturing the audio. > > Dand: COuld be a matter of security settings. > > Bjorn: Don't say that we disallow customization, but don't require > this. > > DanD: End up with fragmentation. WOn't work cross browser. > > Bjorn: Allow for non-browser apps. > ... Note for future discussion. > ... Allow customization of the user interface that show audio > capture is happening. > > Burn: Have a discussion topic of the level of customization allowed. > > Bjorn: SHould have customization for the UI for starting > recognition. Have discussion topic: customize UI for showing that > audio is being captured. > > MichaelJ: Waveform, traffic lights? > > Bjorn: Can app customize what the app looks like? > > MichaelJ: Can customize one that show up in the UI. > ... Multimodal tap and talk API. Want creativity. Activate > recogntition button. DOn't want to rule out certain kinds of APIs. > Dont' want built-in browser feedback to interfere. > > Burn: come back to this discussion later. > > Begin discussing issues listed in the Appendix. > > Burn: Have time to discuss a serious topic. Can work out serious > issues at FTF. > ... Determine which topics have more meat. Start with audio. > ... 3 audio related topics. How to get audio capture access. > Manditory audio codecs. Audio streaming support and how. > > Bjorn: 1st unrelated to 2nd two. 1st is API. 2nd two how audio is > sent form browser to implementation. > > Burn: How to get audio mic capture access. > > Bjorn: MS proposal has mic selection. What are use cases? > > <burn> "audio mic capture" is "audio/mic/capture" > > Robert: Browser going to have mic API anyway. Avoid 2 mic APIs. 1 in > speech and anothe unrelated (explicit). Want speech API to integrate > with browser API. > ... Many devices will have mult. mics. Improtant to select the one > you want. Maybe app or user through prefences. > ... May want to configure mic settings. Use for things other than > speech. E.g. video app that does speech reco. > ... MS API allows this. Can get audio strem to reco. Look at > multimodal scenarios. Need for integrated API there. Speech API > should integrate. > > Bjorn: Can buy most of that. > ... If there is one there, should be able to use for speech. But no > such standard API yet. > > Robert: Pushing capture API heavily. With michael. IE team thinks > this is a sound approach. > > Burn: Agree ability to select diff. audio sources. > > Robert: Not quite it. If browser has mic API - we should be able to > use it. > > Bjorn: Agree. But if not one, don't want to come up with one > ourself. > > Olli: agree. > > Bjorn: If HTML standard has one, we should be able to use it. > > Robert: Fine with HTML rather than browser. > > Burn: Meta decision. Use HTML if exists, but not create one. > > Robert: Have requirements for such an API? > ... Latest draft doesn't have notion of stream of endpointing. And > we care deaply about these for mic API. > > Bjorn: Why does mic API need endpointing? > > Robert: Can be a long way between mic and endpointer. > > <burn> should "stream of endpointing" be "stream or endpointing"? > > Bjorn: Requirement that endpointing be available for things other > than speech. > > Michael: Hopefully, have agreement - will work with people designing > the API and express requirements. > > Bjorn: Seems fair. > > Olli: Capture API in HTML draft or draft working group. > > Robert: Mean the one in the DAP working group. > > Bjorn: Think we should work with HTML. > > Burn: 2nd one tricky. Wrote we will capture an express requirement > on a capture API to relavent groups. > > Bjorn: Seems reasonable. Avoid "capture". > > Burn: requirements on audio capture APIs. > ... requirements on all audio capture APIs. > > Bjorn: seems fine. > > <mbodell> Olli, is there a capture API in the w3c HTML draft? I > don't see it at [12]http://dev.w3.org/html5/spec/Overview.html > > [12] http://dev.w3.org/html5/spec/Overview.html > > <smaug> mbodell: I don't read that version of HTML spec ;) > > Bjorn: If no HTML audio capture API. Propose that we proceed even > without a mic API. > > <smaug> mbodell: > [13]http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd. > html#video-conferencing-and-peer-to-peer-communication is an early > draft > > [13] http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.html#video-conferencing-and-peer-to-peer-communication > > <burn> (now Robert is speaking) > > Robert: Concern - browsers will need to implement privacy and > security policies. Weird to have for speech alone, but not audio > capture in general. May be messy. > > Bjorn: Forge ahead, and consider audio capture in general. > > Burn: Agreement that's important. > > Bjorn: Having control over audio capture does not have to be in the > first proposal. > > Burn: Is that the concencus? > > Bjorn: OK to have speech API if there is not an audio capture API. > > Robert: Not create one, and shouldn't be blocked from moving > forward. > > Burn: Not create one and not block while waiting for one. > > Michael: May design suboptimal if no audio capture API and may not > fit well once it's there. > ... Premature to jump to say we can make total progress without > that. > > DanD: Goal for group to submit the requirements to the other working > groups. Accelarating the cature API for audio may be one of the > recommendations. AT&T member of DAP. Recognize needs. > > Bjorn: Agree we should not block this progress while waiting. > > DanD: May create fragmentation. > ... Unless abstracted completely to "get mic". > > Bjorn: Agreed that we should start reco without specifying mic. > > DanD: Concerned that we should avoid fragmentation. > > Burn: Good to get agreement. > > Dand: API for capture, if we are able to capture the audio without > web developer going through coding, then we are fine. > ... If anything specific in the web application to retrieve the > audio handle, then we're looking for if-then-else statements. > > Bjorn: We would like to do the former. > > Burn: What is meant by "start of speech", "end of speech", and > endpointing in general? How do transmission delays affect the > definitions and what we want in terms of APIs? > > Robert: Divide into smaller topics. Distributed env., with speech > services remote. 2 notions of endpoiting: by reco or cheap on client > (responsiveness and reduced network IO). Look at these 2 as > seperate. > > Bjorn: Throw out proposal. Require client-side simple endpointer? > > Robert: Has my vote. > > Burn: No endpointer on my computer. > > Bjorn: Browser could do simple energy-based end pointing. > > Robert: Lots of optinos. GSM encoder has endpointer. Can have local > reco and use for endpointer. > > Burn: APi needs to assume client as well as server-side endpointer. > client could be null op? > > Bjorn: Stronger: has to be something in the client that does tell > start and end of speech. even if not good. > > Michael: Can see recommending. Don't know how web author can know. > requirement is low latency. doesn't matter after that. > > Bjorn: Agree with that. But if app points to specific recognizer, > can interact. > > Burn: Why concerned. Reco can get finicky about input based on > training. Endpointing is mostly done in advance. Be careful about > requiring local endpointing. If bad, can affect reco. > > Bjorn: Avoid bad endpointers. > ... Low latency speech dectection should always be available. > > MichaelJ: But not forced to use it. FedEx example: some query - > using endpointing from reco - want them to be able to use the > standard. Client endpointing could cause errors. > > Bjorn: Have some parameters. Make it easier for the app. Think > you're speaking. > > Burn: Ongoing recognition case - won't use loca endpointer. > ... plenty of open mic apps - listen for keywords. > > Bjorn: Should be one, but should be possible for app to turn off. > > Robertt: probably want app to turn it on if it needs it. > > Michael: Set a parameter and get it that way. > > Bjorn: Hello world app. > > Charles: Level for feedback - good to be local. > > Burn: Low latency endpoint detector shoudl be available. > > Bjorn: Don't have agreementn if on or off by default. > > MichaelJ: Talking about detection of end of speech or start too? > > Burn: may be big difference. > ... Want low latency to turn on speech to reco - but don't want it > to stop. > > Bjorn: we do the opposite. > ... Start streaming right away, server endpoints, but need to stop > streaming at some point. > > Robert: very scenario dependent. Need start stop speech event. Start > when click of button, end matters a lot. Need to have optinos > available. > > Burn: Forwarding audio to expensive recognizers. Want high accuracy > on end pointing. Don't want to send audio unless we have to due to > expense. > > Bjorn: Cutting off audio vs. endpointer. Can not listen for the > event. Control if endpointing cuts off audio. > > MichaelJ: Need to control when start sending audio to recognizer. > > Burn: Start speech adn reco can be different. > > MichaelJ: If reco on for a long time, may want to do something do > delay until there is certainty of speech. > > Bjorn: Agree tha there is low latency endpointer is available. > Should be possible for app to decide if audio is started of stopped > on endpointer. > > Burn: Audio start /stop separate from speech start/stop. Seperatly > controllable. > ... Detector detects both start/end of speech and fires an event in > each case. > > Bjorn: Seperate issue of cutting off audio. > > Burn: Audio to the reco process as opposed to TTS. > ... Audio start and stop to reco server (resource)... > > Bjorn: Control over which audio is used for speech recognition. > ... which part of the captured audio. > > DanD: Make sure we carefully agree that we are not forcing the > application into using the predefined environment engine of the > browser and still allow developer which engine to use. > > DadD: have a flag. If use optimzied endpointing in application of > not. > > Bjorn: Seperate from how you choose the engine. > > MichaelJ: Related - if turned on, give some sort of event for local > prediction of begin/end of speech, is that the resolution we want? > If level dectector, can also get level? > > Bjorn: Ahould be a more precise way to get actual events from > recognizer. Level part of mic API? > > MichaelJ: Could be raw energy detector, limited reco listing for > "silence", etc. for the local part. The browser, client side, can > have best that it can. Not saying anything about how it's done. > > Burn: May be a difference when there are multiple endpointers. (1) > low latency - prefilter to decide if goes to reco, (2) high quality > in engine. > ... Would want recognizers endpoint detector. But preprocess one is > the low latency one. > > Bjorn: 2 event : 1 probably vs. actual start/end of speech. > > MichaelJ: Talking now vs. not. More going on underneath. Get > complicated to expose underneath if varies by implementation. Energy > level might drive aspects of the API. > > Burn: Why want distinction? Mic open is one option. ANother is that > engine is paying attention. ANother is that engine found something > importatnt. > ... Might decide that it's not hearing anything. > > Bjorn: Started capture, think starting, actually starting. 1st 2 go > in the UI. Good to have last for timing. > > Burn: In VXML2, have hot word detection. Concluded it doesn't act as > if speech is detected untell something happens. Acts as if nothing > happend if nothing reco'd. May collapse 2nd and 3rd states. > > Bjorn: Thought we had agreement earlier. > > Burn: Agreed we had some sort of start and end. Knew we needed to > discuss it. > > Bjorn: 3.3.3. - onspeechstart/end/error. Need to add more to this > list. > ... propose adding onaudiostart onaudioend, and split onspeechstart > to detected vs. actual (reco). > > MichaelJ: energy vs. reco? split > > Bjorn: Could be confusing. > > MichaelB: onsoundstart? > > Bjorn: sounds like a good name. > > MichaelJ: Issues of calibration? Sensitivity parameters? Used on > mobile phones or elsewhere. Might need calibration to work well. > > ???: Sensitivity and timeout parameters. > > Burn: Whole topic to discuss parameters. > > Bjorn: Discuss parameters in context. > ... Agree on adding these events? > > Burn: We will add onaudiostart/end ... Dan will cut and paste here? > > Bjorn: onsoudstart/end shold be low latency. Also say somehting > about order. > > Burn: OK. audiostart, soundstart, speechstart, speechend, soundend, > audioend > > Bjron: Might not get soundstart or speechstart. > > Burn: onsoundstart, require soundend. > ... soundend optional? > > Bjorn: not true. > ... Can't have onspeechstart without the preceeding two. > > Charles: Want end events with start events. > > Burn: Can have ends all at the same time. > > Bjorn: what if onerror? > > Burn: Great topic. > ... capture that as issue for discussion. > ... what happens to audiosound and speech events in case of error. > > Bjorn: Also sensitivity discussion point. And timeout parameters for > ASR. > > Burn: Meeting next week. Can have call after that. Meeting after > that. 2 days of meeting. > > >
Received on Thursday, 12 May 2011 15:19:05 UTC