[f2f] minutes 3 November 2011 from Dan Burnett on 2011-11-04 (public-xg-htmlspeech@w3.org from November 2011)

From: Dan Burnett <dburnett@voxeo.com>
Date: Fri, 4 Nov 2011 14:13:28 -0400
To: public-xg-htmlspeech@w3.org
Message-Id: <D51AEA87-CE44-42C9-B2C5-ACF57CE45A34@voxeo.com>

Group,

The minutes from our first day are at http://www.w3.org/2011/11/03-htmlspeech-minutes.html.

For convenience, I have pasted a text version below.

-- dan

*****************************

HTML Speech Incubator Group Teleconference

03 Nov 2011

See also: [2]IRC log

[2] http://www.w3.org/2011/11/03-htmlspeech-irc

Attendees

Present
DanB, Michael, Glen, Matt, Robert, Patrick, Avery, Nagesh,
Debbie, Bertha, Milan, Rahul, DanD

Regrets

Chair
Daniel_Burnett,Michael_Bodell

Scribe
ddahl_, ddahl

Contents

* [3]Topics
1. [4]Review recently sent examples
2. [5]Robert's example
3. [6]speech-enabled email
4. [7]Milan's example of protocol
5. [8]michael johnston's multimodal use case
6. [9]Charles Hemphill's example
7. [10]Michael Bodell's example 8, translation
8. [11]Debbie's example
9. [12]another example from Charles Hemphill
10. [13]issues
11. [14]Protocol Issues
12. [15]Web API Issues
13. [16]Issue 6
14. [17]Issue 7
15. [18]Issue 8
16. [19]Issue 9
17. [20]Issue 10
18. [21]Issue 11
19. [22]Issue 12
20. [23]Issue 13
21. [24]Issue 14
22. [25]Issue 15
23. [26]Issue 16
24. [27]Issue 17
25. [28]Issue 18
26. [29]Issue 19
27. [30]Issue 20
28. [31]Issue 21
29. [32]Issue 22
30. [33]Issue 23
* [34]Summary of Action Items
_________________________________________________________

<smaug> hi

<smaug> well, who am I then o_O

<smaug> pong

<burn> trackbot, start telcon

<trackbot> Date: 03 November 2011

<Milan> ScribeNick: Milan

Review recently sent examples

<DanD>
[35]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#introduction

[35] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#introduction

<mbodell> [36]http://bantha.org/~mbodell/speechxg/example1.html

[36] http://bantha.org/~mbodell/speechxg/example1.html

Michael: Speech Web Search Markup only

Robert: Found addGrammarFrom() is awkward
... really a hint

Glen: True that input has no grammar

Michael: It's a builtin grammar

Robert: What about derviveGrammarFrom

Glen: It's an append grammar

DanD: Option might be a better example

Michael: Text is a grammar

Robert: Assume q is an object from which a grammar can be derived

<smaug> Nit, <button name="mic" onclick="speechClick()"> is a submit
button, so when you click it, the form is submitted. type="button"
would fix the problem

DanB: addDerivedGrammar

Debbie: Figgure out semantics first

Robert: AddDerivedGrammarFromID

Glen: Also rename q to 'inputField'
... Also from text input type to date or somethign more contrained
... Need to specify the lack of grammars
... Is this dictation?

Robert: improve example by defaulting to UTF-8

<glen> Section 5.1: when no grammar specified, defaults to
builtin:dictation

Robert: Base 64 encoding is ugly
... to the point where it is unsualbe

Michael: Worried about directly inserting XML due to 8th bit

DanB: Are there already common protocols for inserting strings
derived from URLs into local variables?

Glen: Should only be a W3C standard, implmentation is orthoginal

Robert: AddFromString() would be nice:?

Glen: addStringGrammar() and addElementGrammar()

Avery: Perfer longer name because its truer to form

<smaug> Couldn't you just prepend "data:application/srgs+xml," to
the serialized XML. But anyway, using data urls is kind of hackish,
IMO.

Robert: Too many dots to get the interpretation

Milan: Propose addGramamrFromURI()

Robert: Newing up a speech grammar is better approach

Michael: Let's just raise issues now rather than solve them

Debbie: Example is complex, and gets mixed up with arguement that JS
is complex

* laptop?

Michael: Next example from Bjorn

Robert: The example lacks a grammar

<smaug> s/onclick="startSpeech"/onclick="startSpeech(event)"/

<DanD>
[37]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0008/web-speech-sample-code.html

[37] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0008/web-speech-sample-code.html

Robert: Need to define what happens when lacking a grammar

Avery: Is there a policy against comments in the examples?

Michael: Planning on adding examples to an appendix

Avery: It's a decent example, as long as it is clear that this
instance lacks a grammar

Robert: Example shows default behavior

Rahul: Could also delete button as means of shorting example

<glen> per Avery's suggestion: add a comment "since no grammar is
specified and no element is binded, uses default grammar
builtin:dictation"

Rahul: Two different ways to perform same array access

Glen: Should make it consistent in example

<mbodell> In Bjorn's second example need sir.maxNBest = 2;

<glen> use same notation: s/q.value =
event.result.item(0).interpretation;/q.value =
event.result[$1\47].interpretation;/

Robert: Intent is to get a text transcript of the user's input
... why are we accessing the interpretation instead of tokens?

Milan: Need to bring this up in protocol team

<all agreed> to replace to "utterance" in place of interpretation

Milan: Last two comments should apply here as well
... Should we have company-specific references?

Michael: Prefer example.org

Robert: Is there speech recognition in turn by turn>

Michael: Speech recognition is just destination capture

<smaug> Again, s/onclick="startSpeech"/onclick="startSpeech(event)"/

Robert: The prefer speek next instruction should cancel last
instruction

Glen: Thought the purpose of example was to show interplay between
speech and tts?

Michael: TTS play resumes where last left off

Glen: Way to stop prior play is a good feature
... we should change this example

<glen> change example to show how to stop, by persisting the tts
object and calling stop before adding .text and .play

Michael: Ollie example next

<mbodell>
[38]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0009/htmlspeech_permission_example.html

[38] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0009/htmlspeech_permission_example.html

Micahel: First example is just removing unauthorized elements?
... but second example doesn't allow speech input to start

Ollie: Yes

Michael: Can you transition from not authorized to authorized?

Ollie: Should be possible, but example doesn't do that
... but could also just reload the page

* Going on break now

<inserted> scribe:ddahl_

<scribe> scribe:ddahl

Robert's example

robert: two recognitions in a row, you want to pick your cities
based on what state you're in.

<Avery> Actually I think it's based on what state is specified in
the first reco, not necessarily what state you're in. A minor nit.

robert: it really should say "interpretation.state", not just
"interpretation"
... used push instead of adding things to the array of speech
grammars
... a bug on result, should be city, also, sr.onMatch should be
sr.onResult
... second example is rereco
... gives grammars to speechInputRequest, then classifies, then does
rereco with a specific grammar

glenn: this seems to be a strange use of "interpretation"

robert: there is a huge universe of grammars

rahul: this is identifying one grammar as different from the others

robert: using the attribute "modal" to activate and deactivate
grammars
... would change the example to get interpretation.classification
... strange to have multiple "modals" as true, think modal might be
a bad idea

speech-enabled email

michael: one interesting thing is that you might get notifications
that you would want to speak to, but without clicking

robert: was mostly thinking about things like "reply", but you could
also imagine saying "read it to me" after notification
... made up a method to cancel TTS

michael: you could just delete the element

robert: what if you set up the element with stuff in it?

glenn: destroy should not be to only way to cancel

Milan's example of protocol

milan: will augment with API calls that trigger protocols
... need a result index of some kind
... then recognizer decides to change its mind and reorders results
... strange to get a "complete" result in the middle of a long
dictation
... result index 0 is the first fragment, then halfway through the
second fragment, the recognizer says the first one is done
... different from MRCP, because in MRCP that means it's the end of
it
... then retracts a result, not sure how to represent this, maybe an
"IN_PRO
... GRESS" message with no payload
... we will put this in the larger document as an example of the
protocol

michael johnston's multimodal use case

<smaug> Could you please paste links to the example here

michael: "I want to go from here to there" is the use case

<smaug> ( would be then easier to read minutes later )

<mbodell> Michael's example:
[39]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0020/multimodal_example.html

[39] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0020/multimodal_example.html

<mbodell> You can walk through the examples from:
[40]http://bantha.org/~mbodell/speechxg/f2f.html which links to
[41]http://bantha.org/~mbodell/speechxg/examples.html which then
walks through the examples

[40] http://bantha.org/~mbodell/speechxg/f2f.html
[41] http://bantha.org/~mbodell/speechxg/examples.html

glenn: it would be good to have a "state" attribute
... the "nomatch" state is more of a result, not a state
... we may need more than one attribute to get results of speech
processing

michael: this also has the EMMA so that you can see the mapping from
EMMA
... this example makes use of a remote speech service

<glen>
[42]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0020/multimodal_example.html

[42] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0020/multimodal_example.html

michael: the EMMA shows the combined speech and gui input

robert: this should be a wss: , that is, a web socket protocol, but
what should we do if someone uses http?

michael: you could get the command right but not the person if you
didn't do the "clickInfo"

Charles Hemphill's example

<glen>
[43]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/0024.html

[43] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0024.html

danD: we should start with the simplest example

Michael Bodell's example 8, translation

<glen> [44]http://bantha.org/~mbodell/translate.html

[44] http://bantha.org/~mbodell/translate.html

<glen> view-source:[45]http://bantha.org/~mbodell/translate.html

[45] http://bantha.org/~mbodell/translate.html

michael: different example of translation
... there's from and to languages, you choose, and then click on
microphone to talk
... there's a progress bar that get's updated
... we're grabbing our language from the selector, we're using a
dictation grammar for whatever language we're using
... where are we doing capture?

glen: wouldn't that be the microphone?

michael: not necessarily, there could be other things like media
streams

glen: is capture necessary or does it just provide more features?

michael: we didn't have any examples of capture from other places,
like from Web RTC
... right now there's no standard for accessing microphone

glen: would like to see default example where we don't have to
explicitly do capture

michael: all examples assume that there's magic for capturing audio

glen: can't we make it so that the magic is what happens by default?

dan: there are many security and privacy issues
... different permissions for getting access to media but also to do
something to the media

michael: this is also raised in some of our issues, we only have a
two sentence note now
... can TTS work on Web Sockets?

robert: yes

michael: on audio start, etc. are in our spec. another issue is that
payload of start, stop events isn't defined

robert: : do we have VU meter events?

michael: no

dan: that came up in Web RTC, they don't have that, but they could
create it

michael: we do have speech-x events for custom extensions

robert: most speech apps have one

michael: is that part of the UA or the app?

Debbie's example

multi-slot filling

<mbodell> Debbie's:
[46]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0031/Multi-slotSpeech1.html

[46] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0031/Multi-slotSpeech1.html

debbie: in this example you have to pull out the slot values from
the EMMA

robert: is this the same as saying "interpretation.booking"?

debbie: not sure
... we don't know what's in "interpretation"

robert: we could get rid of "interpretation"

michael: it could be a useful pointer into the EMMA
... that is available in VXML

<mbodell> Issue: we should make sure it is clear what the
interpretation points to

<trackbot> Created ISSUE-1 - We should make sure it is clear what
the interpretation points to ; please complete additional details at
[47]http://www.w3.org/2005/Incubator/htmlspeech/track/issues/1/edit
.

[47] http://www.w3.org/2005/Incubator/htmlspeech/track/issues/1/edit

michael: should do an if to make sure that you really got a value

debbie: could add the EMMA
... would there be value in some kind of convenience syntax so that
you don't need the full DOM generality to manipulate the EMMA
result?

<mbodell> Charles' example:
[48]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/0033.html

[48] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0033.html

another example from Charles Hemphill

michael: the same example as before but with an external grammar

avery: what's the advantage of having "reco" element as a child
under "input"

michael: there are two different ways to do the same thing, with
"reco" under as a "child" under <input> you don't need an id

<smaug> <input> element can't have child elements

actually, input is a child of reco in the proposal

<smaug> My comments to example 3
[49]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/0034.html

[49] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0034.html

michael: another example with a real inline grammar so that you
don't have to do data uri
... we would have to define a "grammar" tag

robert: we would have to define for browsers how to interpret SRGS

avery: like putting script in page vs an external reference

<smaug> Milan: remember, we're talking about HTML here, not XML

<smaug> (I assume that was Milan)

milan: could we say "as long as this is valid XML ignore it and pass
it to us"?

robert: why wrap the whole thing with the grammar element?

michael: if there's an SRGS 1.1, you wouldn't know what version it
was, for example
... would like to have inline grammar, if any, be full SRGS with
<grammar> element
... that is the end of the examples

<Milan> * Good point Ollie

<glen> scribenick: glen

issues

burnett: if can't agree, depends on importance. If important,
capture different opinions in doc.
... (not required to resolve everything in incubator group)

<mbodell> First issue to discuss:
[50]http://bantha.org/~mbodell/speechxg/issuep1.html

[50] http://bantha.org/~mbodell/speechxg/issuep1.html

1. What Content-Type do we want to use on an empty message? Use case
was nulling out previous candidate recognition.

milan: do we have to specify? can it be assumed?
... empty means no payload?

robert: protocol doesn't require a body
... in which case I don't think it needs a content type. Example
getParams

michael: what about interim results?
... if no content and not content type then nulls out corresponding
result. Example: an interim result gets replaced with no result
(e.g. if a <cough> is initially recognized as some text)

Protocol Issues

2. I am skeptical about changing established MRCP event/method
names. I sort of agree that LISTEN is better than RECOGNIZE, but do
not think the reasons are good enough to warrant ensuing churn.

Robert: Microsoft doesn't care if similar to MRCP, rather that it's
compatible with our web sockets protocol

burnett: web sockets is just a transport
... violates many types of protocol design
... if standards track, IETF is a logical place

robert: so naming doesn't matter much at this point.

all: agree

burnett: some talk of using SIP to setup, would have to separate
signaling and data...which is one thing wrong with this.

robert: this is more to illustrate a point that it can be done

burnett: companies could implement today, and may not be completely
interoperable (as is often the case on first implementations)

michael: we agree, not to change names right now. Names will likely
to be re-evaluated in a standards track.
... minor syntax issues can be called out as a note in the doc.

burnett: when gets into a standards group, they look at requirements
and take ideas into consideration, but they consider MANY other
factors, e.g. security, that drive

3. We need a way to index the recognition results. I suggest using a
Result-Index header

all: agree to add. if a one-shot recognition, it's only [$1\47] and
still optional

4. It was awkward to use a RECOGNITION-COMPLETE message presumably
with a COMPLETE status during continuous speech. Instead, I used
INTERMEDIATE-RESULT with a new Result-Status header set to final.

robert: just rename RECOGNITION-COMPLETE as RECOGNITION-RESULT
... it's an intermediate, unless it's a final response type.

burnett: MRCP has separate status code and completion code

Milan: we need a complete flag, not sure it was defined. We haven't
stated which status codes correspond to which messages.

burnett: in MRCP, status is about communication (like 200 OK). In
MRCP, the completion code indicates what happened (e.g. successful
reco)

robert: so status indicates "sending more", so status should be
in-progress for continuous reco case.
... need request state?

burnett: request has been made, has it been completed yet? status is
success, illegal method, illegal value, unsupported header

robert: reco result, 200 OK, in progress

5. Perhaps Source-Time should also be required on final results

all: yes, everything's fine, more to come

Milan: by time have final result, should know start time.

all: agree, require only reco result

Milan: could be reco result with type = pending

michael: pending implies have already started

robert: in progress more accurate

all: agree to leave as is

6. Wanted to confirm that channel identification is being handled by
the WebSocket container

robert: handled by web socket
... if two separate recos, then two web sockets and two audio
streams. (Can have 2 grammars active in one reco)

milan: continuous hotword case

robert: that's continuous reco
... start session with hotword and command-control grammar, all is
continuous results

michael: hard if change over time
... because have to pause to change
... so not continuous

robert: don't want to transmit audio twice, but with two sessions,
you must

avery: does emma result specifies which grammar?

michael: yes

7. I noticed that Completion-Cause was missing from Robert's spec
example in section 4.2.

robert: accidental omission, need to add

Web API Issues

1. To get the reco result I think i have to write
"e.result.item(0).interpretation". This is a lot of dots and an
index just to get the top result.

robert: I want to write e.interpretation -- because most of the time
that's what I want (but still could use the verbose way as well)

<mbodell> Here is the link to where the event is defined:
[51]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#speechinputresult

[51] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#speechinputresult

milan: e.result.interpretation

michael: can already use e.result[$1\47].interpretation

glen: we should change utterance to match

all: e.interpretation and e.utterance
... agreed

2. "utterance" has a couple of different meanings in the doc. It's
alternatively the recording of what the person said, or the
transcript returned by the recognizer.

michael: transcript? text? tokens?

text and token are over used and confusing

robert: but it is text, so not overloading the concept
... (unlike token)

burnett: transcript, closest to what's actually happening, laymen
get it

glen: text is not descriptive: interpretation is text, whereas
transcript vs interpretation is clear

all: agree: rename utterance to transcript

5. The "modal" attribute on SpeechGrammar is unnecessarily
restrictive

Discussion: There are cases where I'll want to have multiple
grammars active, but not all, and not just one. Developers would be
better off with a boolean enabled attribute on each grammar. Would
be useful to clarify the behavior when there is more than 1 grammar
with this set to true (only the first in the list is active?) Is
this even useful at all? What is the case for having grammars which
aren't active in the reco? Can we change the state of the modal/u

robert: less lines of code if just set one to true

milan: alternatively, could add/remove from grammars array

glen: sending all at once allows caching
... of grammars
... what about continuous case, can grammars change on the fly

michael: we decided to simplify by re-calling .start to change
grammars or anything else

milan: should have a separate way to preload

burnett: voicexml has defineGrammar

milan: grammar set object on the SpeechInputRequest
... I proposing sets of grammars

robert: I'd like it flatter, get rid of enabled/disabled -- just
delete -- and don't allow preload

michael: already have .open that allows preloading

<mbodell> See web api:
[52]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#dfn-open

[52] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#dfn-open

<scribe> scribenick: glen

burnett: nervous about this, we discussed this for a long time and
considered many edge-cases

robert: alternative: get rid of modal and enable, and just use a
bunch of grammars

avery: if .open has already been called, .start doesn't call it
(.start only calls .open if hasn't been opened yet)

burnett: wondering if there are performance advantages in reco
engine if can call enable/disable as opposed to calling .open
multiple times?

<smaug> start might need to re-call open if authorizationState has
changed from not-authorized to authorized

milan: MRCP didn't solve this, why should we?
... all good MRCP clients do what you're saying automatically, they
automatically check for the deltas

robert: this runs at web-scale, distributed
... big difference between telephony and web

michael: options: eliminate model, keep and define what happens if
multiple set to true

avery: easier to add later than remove

agree: eliminate .modal

5. The interpretation attribute is completely opaque. That may be
necessary given that SISR can pretty much return anything. But it'll
need some examples to show how to use it.

burnett: there was support for a flat array of interpretations
... I didn't like that, Nuance and their customers didn't like it,

debbie: use emma to define layout

michael: different reco engines may use emma in different ways
... fundamentally, .interpretation points to somewhere in emma,
which simplifies (and a corresponding .transcript)

all: agree, specify which part of emma holds the interpretation

michael: mapped to a DOM object, emma literal or node
... like debbie's slot-filling example
... I will send text for this

5. The array of SpeechGrammar objects is too cumbersome

<smaug> something happened to the audio. it is all just noise

<smaug> though, getting late here

Discussion: Robert: The array of SpeechGrammar objects is too
cumbersome. In most cases I'd like to write something simple like:
mySR.speechGrammars.push("USstates.grxml","majorUScities.grxml","maj
orInternationalCities.grxml"); But I can't. I have new-up a separate
object for each one then add it to the array, even when I don't care
about the other attributes. Better to just make it an array of URI
strings, and add functions for the edge cases. e.g. void ena

void setWeight(in DOMString grammarUri, in float weight); And yeah,
I remember arguing the opposite on the phone call. But that's before
I tried writing sample code. Glen: "The uri of a grammar associated
with this reco. If unset, this defaults to the default builtin uri."
Presumably using the grammar attribute overwrites the default
grammar, so if a developer wishes to add a grammar that supplements
the default grammar, then this alternative should work: re

would add clarity. Michael: If you view source on the web api
document you'll see the grammar functions and descriptions are there
commented out as I anticipated, and agree, with this comment. We
should have both functions and array/collections and this makes the
things that Robert and Glen describe much easier/better.

michael: grammar spec after ? are hints, before builtin: are
required and errors if not supported
...example: builtin:contacts may recognize names in smartphone
... require built:generic

burnett: built:generic means I'll take anything you got: if it's
just a date grammar, I'll take it.

<mbodell> We are talking about
[53]http://bantha.org/~mbodell/speechxg/issuew5.html but really more
about what happens with no grammar

[53] http://bantha.org/~mbodell/speechxg/issuew5.html

milan: builtin:generic could respond with failure, builtin:dictation
could also respond with failure

robert: builtin:generic should be builtin:default
... and none specified is builtin:default

burnett: what if want to use both default and another grammar

glen: then add builtin:default and builtin:foo

michael: default is not user default, but service or ua default

milan: want a way to record without a grammar

michael: we define builtin:default, encourage vendors to implement,
and state when none specified, it's on by default. (and when other
grammars specifed, it can also be added.
... I like .addGrammer(url, weight) as a simplification from
creating object and then setting it

robert: .addGrammarFromUrl(url, weight)
... .addGrammarFromElement(element, weight)
.addGrammarFromString(string, weight)
... better yet: .addUrlGrammar .addElementGrammar .addStringGrammar
... but advantage for objects to be alphabetical order, grouped
together in docs

glen: .addGrammarUrl .addGrammarElement .addGrammarString
... remove is a JavaScript array operation

michael: also .addCustomParameter(name, value)

all: agree: .addGrammarUrl .addGrammarElement .addGrammarString
.addCustomParameter

<smaug> I think this is enough for me. I'll read the minutes
tomorrow and send comments

<smaug> It is midnight here

<smaug> dark? it has been dark hear for the last 6 hours

<smaug> here

<rahul> scribenick: rahul

Issue 6

<mbodell> Link to the current issue:
[54]http://bantha.org/~mbodell/speechxg/issuew6.html

[54] http://bantha.org/~mbodell/speechxg/issuew6.html

<glen> 6. The names are a bit long.

<glen> Discussion: e.g. "new SpeechInputRequest()" vs "new
SpeechIn()" . e.g. "mySR.speechGrammars.push("foo")" vs
"mySR.grammars.push("foo")" . e.g. "resultEMMAXML" vs "EMMAXML" or
just "EMMA" (call the other one "EMMAText" ) e.g. "inputWaveformURI"
vs "inputURI"

Milan: how about SpeechRequest instead of SpeechInputRequest?

Robert: SpeechRecognizer?

Milan: AudioSynthesizer?

Glen: SpeechReco?

<Milan> Milan: AudioSynth

<Milan> * test

Resolution: We will use SpeechReco instead of SpeechInputRequest

<matt> [55]Parkinson's Law of Triviality

[55] http://en.wikipedia.org/wiki/Bikeshedding

<scribe> ACTION: Editing team to update to SpeechReco [recorded in
[56]http://www.w3.org/2011/11/03-htmlspeech-minutes.html#action01]

<trackbot> Sorry, couldn't find user - Editing

Issue 7

7. SpeechInputRequest.outputToElement() should be an attribute,
perhaps 'forElement'

<matt> [57]Issue 7

[57] http://bantha.org/~mbodell/speechxg/issuew7.html

Resolution: Replace outputToElement() function with the
outputElement attribute

Issue 8

<inserted> [58]Issue 8

[58] http://bantha.org/~mbodell/speechxg/issuew8.html

8. SpeechInputResult has a getter "item(index)".
SpeechInputResultEvent has an array "SpeechInputResult[] results.

Discussion: Can we change both to be collections similar toÃ
[59]http://www.w3.org/TR/FileAPI/#dfn-filelistÃ (accessible via []
operator and optionally with a .item() method)?

[59] http://www.w3.org/TR/FileAPI/#dfn-filelist

Resolution: Accepted

Issue 9

<matt> [60]Issue 9

[60] http://bantha.org/~mbodell/speechxg/issuew9.html

9. The <reco> element should probably be a void element with no
content on its own

Discussion: Satish:Ã
[61]http://dev.w3.org/html5/spec/Overview.html#void-elements. I just
noticed this in the for attribute's description, missed it in
earlier reads: "If the for attribute is not specified, but the reco
element has a recoable element descendant, then the first such
descendant in tree order is the reco element's reco control." Is
there a benefit to doing this over requiring the 'for' attribute to
be set and making reco a void element? Charles: I a

[61] http://dev.w3.org/html5/spec/Overview.html#void-elements.

<glen> resolution: can specify with either descendent or with for=
attribute

Resolution: Agreed to leave it as-is using either the for or the
descendant pattern

Issue 10

<matt>

<inserted> [62]Issue 10

[62] http://bantha.org/~mbodell/speechxg/issuew10.html

10. TTS is hard

<matt> |[63]http://bantha.org/~mbodell/speechxg/issuew8.html||

[63] http://bantha.org/~mbodell/speechxg/issuew8.html

Discussion: Bjorn: I can't see any easy way to do programmatic TTS.
The TTS element is at least missing the attributes @text and @lang.
Without those, it's pretty hard to do the very simple use case of
generating a string and speaking it. It's possible, but you need to
build a whole SSML document. For use cases, see the samples I sent
earlier today. Dominic: For TTS, I don't understand where the
content to be spoken is supposed to go if it's not specified in

Michael: @lang is not missing since it could be inherited
... there is no @text there

Glen: content within <tts></tts> will show up within older browsers

<mbodell> Discussion is <tts src="data:text/plain,Hello, world"/>
versus <tts value="Hello, world"/> versus something else. Note in JS
we could define a function so it is pretty similar, but from Markup
a little harder to get the function creating the data uri (probably
still possible)

<glen> michael: tts as a markup may render visually a control (play,
stop, etc)

<glen> ...other dom can interact

<glen> glen: most uses of tts need dynamic control -- that is
require javascript

<glen> michael: because tts inherits from media-element, it requires
a src attribute

<glen> michael: <tts> is not used as an alternative fallback

Dan: usecase for <tts> element is to facilitate easy generation as
part of markup rather than generating script

s/|[64]http://bantha.org/~mbodell/speechxg/issuew8.html||//

[64] http://bantha.org/~mbodell/speechxg/issuew8.html

Michael: the @lang inherited from the <media> element should be
passed as a parameter to the synthesizer

Resolution: Add a @text attribute to <tts>.

Issue 11

-> [65]http://bantha.org/~mbodell/speechxg/issuew11.html Issue 11

[65] http://bantha.org/~mbodell/speechxg/issuew11.html

11. How does binding to button work

Discussion: Satish: "When the recoable element is aÃ buttonÃ then if
the button is notÃ disabled, then the result of a speech recognition
is to activate the button." "For button controls (submit, image,
reset, button) the act of recognition just activates the input."
"For type checkbox, the input should be set to a checkedness of
true. For type radiobutton, the input should be set to a checkedness
of true, and all other inputs in the radio button group must b

Michael: propose to have an issue note that this needs further
thought

Robert: define what we can, and for others say there is no binding

Resolution: Add issue note that more work to be done on bindings

Issue 12

12. What about meter, progress, and output elements?

Discussion: Satish: The meter, progress and output elements all seem
to be aimed at displaying results and not for taking user input. Is
there a reason why these are included as recoable elements?Michael:
This is specified atÃ Reco Bindings. A person could want to be able
to speak and have it change a progress bar or meter or output
element. The primary reason is matching what is done with label.
These are all labelable elements and thus ended up as recoable

Glen: suggest we not talk about bindings to these

Dan: we need to decide which ones to leave out, I agree since these
are not even <input> elements

Resolution: Remove these from the recoable elements and bindings

Issue 13

13. grammars and parameters should be collections

Discussion: Satish: Similar toÃ issue 8, SpeechInputRequest
attributes 'grammars' and 'parameters' should probably be turned
into a collection as well

Resolution: Accepted

Issue 14

14. rename language to lang

Discussion: SpeechInputRequest.language should probably be changed
to 'lang' to matchÃ lang attributes.

Resolution: Accepted

Issue 15

15. rename iterimResults to interimResultsInterval

Discussion: SpeechInputResult.interimResults should probably be
renamed to interimResultsInterval to indicate its usage similar to
how other attributes have 'Timeout' in their names

Resolution: Turn into boolean property, name does not change

Issue 16

16. drop enum prefixes

Discussion: SPEECH_AUTHORIZATION_ prefix could be dropped for the
enums and just have 'UNKNOWN', 'AUTHORIZED' & 'NOT_AUTHORIZED'
(similar toÃ XHR States). Same for SPEECH_INPUT_ERR_* and other such
enums.

Resolution: Accepted (given Satish's input and expertise)

Issue 17

17. A way to uncheck automatically by speech?

Discussion: Glen: "For type checkbox, the input should be set to a
checkedness of true." It would be nice to have a way to allow user
to say something to set it to false, but I can't think of a good
convention for this other than adding an attribute or grammar.
Perhaps this could/should only be possible via scripting. (I don't
like the idea of toggling the checkbox because some users may not be
able to easily observe what state the checkbox is currently in.)

Resolution: See resolution to issue 11

<smaug> mbodell: I'm kind of online

<smaug> what enum conflicts?

<smaug> if the const is in an interface, then no

Issue 18

<inserted> [66]Issue 18

[66] http://bantha.org/~mbodell/speechxg/issuew18.html

18. Binding hints versus requirements

Discussion: Glen: "For date and time types ... type of color ...
type of range the assignment is only allowed if it is a valid ..."
On our call we discussed how these grammars are hints, and in
particular how pattern may be difficult to implement. We discussed
that showing an output response, even an invalid one, may be more
valuable than no response. Michael: We can do hints for patterns on
text, and for numbers out of range, but for other types HTML5 is jus

Resolution: See resolution to issue 11

<glen> satish provides this example of two sets of enums, with no
prefixes.

<glen> [67]https://developer.mozilla.org/en/DOM/HTMLMediaElement

[67] https://developer.mozilla.org/en/DOM/HTMLMediaElement

Issue 19

19. Does reco and TTS need to be on a server as opposed to client
side?

Discussion: Dominic: The spec for both reco and TTS now allow the
user to specify a service URL. Could you clarify what the value
would be if the developer wishes to use a local (client-side)
engine, if available? Some of the spec seems to assume a network
speech implementation, but client-side reco and TTS are very much
possible and quite desirable for applications that require extremely
low latency, like accessibility in particular. Is there any
possibility

<matt> [68]Issue 19

[68] http://bantha.org/~mbodell/speechxg/issuew19.html

<glen> satish continues: HTMLMediaElement.LOADED so no clashes

<glen> (above refers to issue 16)

Resolution: The service does not need to be remote, UAs may define
URIs to local engines. We should add clarifying text specifying
this. Also, the serviceURI does not need to be remote. We will
clarify this as well.

Issue 20

20. Set lastMark?

Discussion: Dominic: An earlier draft had the ability to set
lastMark, but now it looks like it's read-only, is that correct?
That actually may be easier to implement, because many speech
engines don't support seeking to the middle of a speech stream
without first synthesizing the whole thing. Michael: Actually the
speech xg version has never supported setting a lastMark. You can
control playback using the normalÃ mediaÃ controls (setting
currentTime, seekabl

Resolution: Leave as-is right now. Add issue note about also making
it writeable.

Issue 21

21. More frequent callbacks?

Discussion: Dominic: When I posted the initial version of the TTS
extension API on the chromium-extensions list, the primary feature
request I got from developers was the ability to get sentence, word,
and even phoneme-level callbacks, so that got added to the API
before we launched it. Having callbacks at ssml markers is great,
but many applications require synchronizing closely with the speech,
and it seems really cumbersome and wasteful to have to add an s

Resolution: Leave as-is. Suggest as enhancement to SSML.

Issue 22

22. How do we fit with capture/input/MediaStream?

Discussion: Michael: Our spec has:Ã attribute MediaStream input;Ã
but we have nearly no explanation of it and our examples don't show
how to use it. Can we do better?

<mbodell> spec link:
[69]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#dfn-input

[69] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#dfn-input

Resolution: This XG probably can't do better. We should have an
issue note and include the assumption that media stream input
somehow happens.This seems to be of interest to numerous groups
(Audio, DAP, Web RTC, HTML Speech XG ...), Debbie will follow up as
part of the HCG.

<scribe> ACTION: ddahl2 to set up follow-up via HCG [recorded in
[70]http://www.w3.org/2011/11/03-htmlspeech-minutes.html#action02]

<trackbot> Created ACTION-4 - Set up follow-up via HCG [on Deborah
Dahl - due 2011-11-11].

Issue 23

23. How does speechend and related events do timing?

<matt> [71]Issue 23

[71] http://bantha.org/~mbodell/speechxg/issuew23.html

Discussion: Michael: Our spec is missing explanations around the
timing and how the information is reflected.

<scribe> Meeting: HTML Speech Incubator Group - 2011 TPAC F2F, Day 1

Resolution: define the data to reflect the source-time back into the
events. Do it on all events that accept time (including result and
speech-x). Note this timing is always relative to the "stream-time"
and real time may be faster or slower than that.

<kaz> [ Thursday meeting adjourned ]

Received on Friday, 4 November 2011 18:31:39 UTC