[minutes] 5 May 2011 from Dan Burnett on 2011-05-11 (public-xg-htmlspeech@w3.org from May 2011)

From: Dan Burnett <dburnett@voxeo.com>
Date: Wed, 11 May 2011 06:05:23 -0400
To: public-xg-htmlspeech@w3.org
Message-Id: <C2F8EA17-CDE3-4C33-ADE9-31B496AC4331@voxeo.com>

Group,

The minutes are available at http://www.w3.org/2011/05/05-htmlspeech-minutes.html

For convenience, a text version is below.

Thanks to Charles Hemphill for taking minutes!

-- dan

Attendees

Present
Dan_Burnett, Michael_Bodell, Bjorn_Bringert, Robert_Brown,
Olli_Pettay, Charles_Hemphill, Patrick_Ehlen, Dan_Druta,
Michael_Johnston, Raj_Tumuluri

Regrets
Debbie_Dahl, Marc_Schroeder

Chair
Dan_Burnett

Scribe
Charles_Hemphill

Contents

* [4]Topics
1. [5]F2F Logistics: Any updates on attendance, hotel
bookings, and questions or details from Bjorn.
2. [6]Review new text in updated "Final Report" document
[$1\47] to ensure it matches what people think we agreed
upon in our last teleconference.
3. [7]Determine if we already have other agreed-upon design
decisions.
4. [8]Begin discussing issues listed in the Appendix.
* [9]Summary of Action Items
_________________________________________________________

<burn> trackbot, start telcon

<trackbot> Date: 05 May 2011

<burn> Scribe: Charles_Hemphill

<burn> ScribeNick: Charles

<burn> Agenda:
[10]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011May
/0001.html

[10] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011May/0001.html

F2F Logistics: Any updates on attendance, hotel bookings, and questions
or details from Bjorn.

Bjorn: no updates on F2F

Burn: will send out schedule in the next few days.

Review new text in updated "Final Report" document [$1\47] to ensure it
matches what people think we agreed upon in our last teleconference.

<burn> document is
[11]http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech
-20110503.html

[11] http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110503.html

Burn: comments on the document - added general design decission - 17
new discussion bullets.

Determine if we already have other agreed-upon design decisions.

Bjorn: discussion topic about mic capture access. Propose design
agreement - should be possible to start speech reco without
selecting mic - just pick default.

Burn: default vs. what you can do - two things.

Bjorn: There should be a default mic. Perhaps the only option.

Born: saying explicit determination of mic should not be required.

Bjorn: Should not need to enumerate mics before starting.

Robert: Think we let you mic other mics.
... that's a reasonable interpretation.
... By default, mic provided by user agent default device.

Bjorn: Need to discuss second sentence later - picking a mic.
... should be able to start reco without selecting mic - confirming
agreement.

Robert: Assuming that the default will be used for mic.

Burn: notion of default mic.

Robert: Issue of user interface. Shows speaker activity. Is there a
default user interface? Can the application override.

Bjorn: Have that requirement for default user interface.

Robert: RE: default user interface - shows it's listening and lets
user cancel.

Olli: What is the default user interface. Something in the browser.

Bjorn: Should only user browser user interface. No Web app user
interface.

Olli: More security or privacy concerns otherwise.

DanD: worried about limitations of only in the browser.

Robert: Don't think that's true.
... Default user interface. Can it be overridden. Where does it
live. 3 discussions.
... Google right in the Web page where the user clicks. Up to user
agent to decide how to render.

Bjorn: Have a default interface now.

MichaelJ: Fine for default. Want APIs to allow someone to build
their own. Different user experience. Allow this. Useful to have
default. But now always appropriate.

Bjorn: Have agreement on default. Have disagreeemnt on your own due
to security reasons, etc.

MichaelJ: Very limiting otherwise.

Bjorn: Should start speech by custom ways including JavaScript. Can
hide that you're capturing audio if custom UI.

Robert: Compromise - default UI parameterized? Provide feedback to
the user. Style sheet. Look at customizations.

MichaelB: Up to user agent to allow customization. Part of
permissions API.

Burn: Should be a default user interface.
... Should there be customization and what level.

DanD: Not all use cases in browsers. Different security concerns if
rendering engine used. Should not be forced by HTML spec to have a
particular UI.
... Don't want to prevent annimated character app that is listening
to you.

Bjorn: Talk about browser case. Need to be clear tha the browser is
capturing the audio.

Dand: COuld be a matter of security settings.

Bjorn: Don't say that we disallow customization, but don't require
this.

DanD: End up with fragmentation. WOn't work cross browser.

Bjorn: Allow for non-browser apps.
... Note for future discussion.
... Allow customization of the user interface that show audio
capture is happening.

Burn: Have a discussion topic of the level of customization allowed.

Bjorn: SHould have customization for the UI for starting
recognition. Have discussion topic: customize UI for showing that
audio is being captured.

MichaelJ: Waveform, traffic lights?

Bjorn: Can app customize what the app looks like?

MichaelJ: Can customize one that show up in the UI.
... Multimodal tap and talk API. Want creativity. Activate
recogntition button. DOn't want to rule out certain kinds of APIs.
Dont' want built-in browser feedback to interfere.

Burn: come back to this discussion later.

Begin discussing issues listed in the Appendix.

Burn: Have time to discuss a serious topic. Can work out serious
issues at FTF.
... Determine which topics have more meat. Start with audio.
... 3 audio related topics. How to get audio capture access.
Manditory audio codecs. Audio streaming support and how.

Bjorn: 1st unrelated to 2nd two. 1st is API. 2nd two how audio is
sent form browser to implementation.

Burn: How to get audio mic capture access.

Bjorn: MS proposal has mic selection. What are use cases?

<burn> "audio mic capture" is "audio/mic/capture"

Robert: Browser going to have mic API anyway. Avoid 2 mic APIs. 1 in
speech and anothe unrelated (explicit). Want speech API to integrate
with browser API.
... Many devices will have mult. mics. Improtant to select the one
you want. Maybe app or user through prefences.
... May want to configure mic settings. Use for things other than
speech. E.g. video app that does speech reco.
... MS API allows this. Can get audio strem to reco. Look at
multimodal scenarios. Need for integrated API there. Speech API
should integrate.

Bjorn: Can buy most of that.
... If there is one there, should be able to use for speech. But no
such standard API yet.

Robert: Pushing capture API heavily. With michael. IE team thinks
this is a sound approach.

Burn: Agree ability to select diff. audio sources.

Robert: Not quite it. If browser has mic API - we should be able to
use it.

Bjorn: Agree. But if not one, don't want to come up with one
ourself.

Olli: agree.

Bjorn: If HTML standard has one, we should be able to use it.

Robert: Fine with HTML rather than browser.

Burn: Meta decision. Use HTML if exists, but not create one.

Robert: Have requirements for such an API?
... Latest draft doesn't have notion of stream of endpointing. And
we care deaply about these for mic API.

Bjorn: Why does mic API need endpointing?

Robert: Can be a long way between mic and endpointer.

<burn> should "stream of endpointing" be "stream or endpointing"?

Bjorn: Requirement that endpointing be available for things other
than speech.

Michael: Hopefully, have agreement - will work with people designing
the API and express requirements.

Bjorn: Seems fair.

Olli: Capture API in HTML draft or draft working group.

Robert: Mean the one in the DAP working group.

Bjorn: Think we should work with HTML.

Burn: 2nd one tricky. Wrote we will capture an express requirement
on a capture API to relavent groups.

Bjorn: Seems reasonable. Avoid "capture".

Burn: requirements on audio capture APIs.
... requirements on all audio capture APIs.

Bjorn: seems fine.

<mbodell> Olli, is there a capture API in the w3c HTML draft? I
don't see it at [12]http://dev.w3.org/html5/spec/Overview.html

[12] http://dev.w3.org/html5/spec/Overview.html

<smaug> mbodell: I don't read that version of HTML spec ;)

Bjorn: If no HTML audio capture API. Propose that we proceed even
without a mic API.

<smaug> mbodell:
[13]http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.
html#video-conferencing-and-peer-to-peer-communication is an early
draft

[13] http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.html#video-conferencing-and-peer-to-peer-communication

<burn> (now Robert is speaking)

Robert: Concern - browsers will need to implement privacy and
security policies. Weird to have for speech alone, but not audio
capture in general. May be messy.

Bjorn: Forge ahead, and consider audio capture in general.

Burn: Agreement that's important.

Bjorn: Having control over audio capture does not have to be in the
first proposal.

Burn: Is that the concencus?

Bjorn: OK to have speech API if there is not an audio capture API.

Robert: Not create one, and shouldn't be blocked from moving
forward.

Burn: Not create one and not block while waiting for one.

Michael: May design suboptimal if no audio capture API and may not
fit well once it's there.
... Premature to jump to say we can make total progress without
that.

DanD: Goal for group to submit the requirements to the other working
groups. Accelarating the cature API for audio may be one of the
recommendations. AT&T member of DAP. Recognize needs.

Bjorn: Agree we should not block this progress while waiting.

DanD: May create fragmentation.
... Unless abstracted completely to "get mic".

Bjorn: Agreed that we should start reco without specifying mic.

DanD: Concerned that we should avoid fragmentation.

Burn: Good to get agreement.

Dand: API for capture, if we are able to capture the audio without
web developer going through coding, then we are fine.
... If anything specific in the web application to retrieve the
audio handle, then we're looking for if-then-else statements.

Bjorn: We would like to do the former.

Burn: What is meant by "start of speech", "end of speech", and
endpointing in general? How do transmission delays affect the
definitions and what we want in terms of APIs?

Robert: Divide into smaller topics. Distributed env., with speech
services remote. 2 notions of endpoiting: by reco or cheap on client
(responsiveness and reduced network IO). Look at these 2 as
seperate.

Bjorn: Throw out proposal. Require client-side simple endpointer?

Robert: Has my vote.

Burn: No endpointer on my computer.

Bjorn: Browser could do simple energy-based end pointing.

Robert: Lots of optinos. GSM encoder has endpointer. Can have local
reco and use for endpointer.

Burn: APi needs to assume client as well as server-side endpointer.
client could be null op?

Bjorn: Stronger: has to be something in the client that does tell
start and end of speech. even if not good.

Michael: Can see recommending. Don't know how web author can know.
requirement is low latency. doesn't matter after that.

Bjorn: Agree with that. But if app points to specific recognizer,
can interact.

Burn: Why concerned. Reco can get finicky about input based on
training. Endpointing is mostly done in advance. Be careful about
requiring local endpointing. If bad, can affect reco.

Bjorn: Avoid bad endpointers.
... Low latency speech dectection should always be available.

MichaelJ: But not forced to use it. FedEx example: some query -
using endpointing from reco - want them to be able to use the
standard. Client endpointing could cause errors.

Bjorn: Have some parameters. Make it easier for the app. Think
you're speaking.

Burn: Ongoing recognition case - won't use loca endpointer.
... plenty of open mic apps - listen for keywords.

Bjorn: Should be one, but should be possible for app to turn off.

Robertt: probably want app to turn it on if it needs it.

Michael: Set a parameter and get it that way.

Bjorn: Hello world app.

Charles: Level for feedback - good to be local.

Burn: Low latency endpoint detector shoudl be available.

Bjorn: Don't have agreementn if on or off by default.

MichaelJ: Talking about detection of end of speech or start too?

Burn: may be big difference.
... Want low latency to turn on speech to reco - but don't want it
to stop.

Bjorn: we do the opposite.
... Start streaming right away, server endpoints, but need to stop
streaming at some point.

Robert: very scenario dependent. Need start stop speech event. Start
when click of button, end matters a lot. Need to have optinos
available.

Burn: Forwarding audio to expensive recognizers. Want high accuracy
on end pointing. Don't want to send audio unless we have to due to
expense.

Bjorn: Cutting off audio vs. endpointer. Can not listen for the
event. Control if endpointing cuts off audio.

MichaelJ: Need to control when start sending audio to recognizer.

Burn: Start speech adn reco can be different.

MichaelJ: If reco on for a long time, may want to do something do
delay until there is certainty of speech.

Bjorn: Agree tha there is low latency endpointer is available.
Should be possible for app to decide if audio is started of stopped
on endpointer.

Burn: Audio start /stop separate from speech start/stop. Seperatly
controllable.
... Detector detects both start/end of speech and fires an event in
each case.

Bjorn: Seperate issue of cutting off audio.

Burn: Audio to the reco process as opposed to TTS.
... Audio start and stop to reco server (resource)...

Bjorn: Control over which audio is used for speech recognition.
... which part of the captured audio.

DanD: Make sure we carefully agree that we are not forcing the
application into using the predefined environment engine of the
browser and still allow developer which engine to use.

DadD: have a flag. If use optimzied endpointing in application of
not.

Bjorn: Seperate from how you choose the engine.

MichaelJ: Related - if turned on, give some sort of event for local
prediction of begin/end of speech, is that the resolution we want?
If level dectector, can also get level?

Bjorn: Ahould be a more precise way to get actual events from
recognizer. Level part of mic API?

MichaelJ: Could be raw energy detector, limited reco listing for
"silence", etc. for the local part. The browser, client side, can
have best that it can. Not saying anything about how it's done.

Burn: May be a difference when there are multiple endpointers. (1)
low latency - prefilter to decide if goes to reco, (2) high quality
in engine.
... Would want recognizers endpoint detector. But preprocess one is
the low latency one.

Bjorn: 2 event : 1 probably vs. actual start/end of speech.

MichaelJ: Talking now vs. not. More going on underneath. Get
complicated to expose underneath if varies by implementation. Energy
level might drive aspects of the API.

Burn: Why want distinction? Mic open is one option. ANother is that
engine is paying attention. ANother is that engine found something
importatnt.
... Might decide that it's not hearing anything.

Bjorn: Started capture, think starting, actually starting. 1st 2 go
in the UI. Good to have last for timing.

Burn: In VXML2, have hot word detection. Concluded it doesn't act as
if speech is detected untell something happens. Acts as if nothing
happend if nothing reco'd. May collapse 2nd and 3rd states.

Bjorn: Thought we had agreement earlier.

Burn: Agreed we had some sort of start and end. Knew we needed to
discuss it.

Bjorn: 3.3.3. - onspeechstart/end/error. Need to add more to this
list.
... propose adding onaudiostart onaudioend, and split onspeechstart
to detected vs. actual (reco).

MichaelJ: energy vs. reco? split

Bjorn: Could be confusing.

MichaelB: onsoundstart?

Bjorn: sounds like a good name.

MichaelJ: Issues of calibration? Sensitivity parameters? Used on
mobile phones or elsewhere. Might need calibration to work well.

???: Sensitivity and timeout parameters.

Burn: Whole topic to discuss parameters.

Bjorn: Discuss parameters in context.
... Agree on adding these events?

Burn: We will add onaudiostart/end ... Dan will cut and paste here?

Bjorn: onsoudstart/end shold be low latency. Also say somehting
about order.

Burn: OK. audiostart, soundstart, speechstart, speechend, soundend,
audioend

Bjron: Might not get soundstart or speechstart.

Burn: onsoundstart, require soundend.
... soundend optional?

Bjorn: not true.
... Can't have onspeechstart without the preceeding two.

Charles: Want end events with start events.

Burn: Can have ends all at the same time.

Bjorn: what if onerror?

Burn: Great topic.
... capture that as issue for discussion.
... what happens to audiosound and speech events in case of error.

Bjorn: Also sensitivity discussion point. And timeout parameters for
ASR.

Burn: Meeting next week. Can have call after that. Meeting after
that. 2 days of meeting.

Received on Wednesday, 11 May 2011 10:05:52 UTC