- From: Dan Burnett <dburnett@voxeo.com>
- Date: Fri, 4 Nov 2011 14:13:28 -0400
- To: public-xg-htmlspeech@w3.org
Group,
The minutes from our first day are at http://www.w3.org/2011/11/03-htmlspeech-minutes.html.
For convenience, I have pasted a text version below.
-- dan
*****************************
HTML Speech Incubator Group Teleconference
03 Nov 2011
See also: [2]IRC log
[2] http://www.w3.org/2011/11/03-htmlspeech-irc
Attendees
Present
DanB, Michael, Glen, Matt, Robert, Patrick, Avery, Nagesh,
Debbie, Bertha, Milan, Rahul, DanD
Regrets
Chair
Daniel_Burnett,Michael_Bodell
Scribe
ddahl_, ddahl
Contents
* [3]Topics
1. [4]Review recently sent examples
2. [5]Robert's example
3. [6]speech-enabled email
4. [7]Milan's example of protocol
5. [8]michael johnston's multimodal use case
6. [9]Charles Hemphill's example
7. [10]Michael Bodell's example 8, translation
8. [11]Debbie's example
9. [12]another example from Charles Hemphill
10. [13]issues
11. [14]Protocol Issues
12. [15]Web API Issues
13. [16]Issue 6
14. [17]Issue 7
15. [18]Issue 8
16. [19]Issue 9
17. [20]Issue 10
18. [21]Issue 11
19. [22]Issue 12
20. [23]Issue 13
21. [24]Issue 14
22. [25]Issue 15
23. [26]Issue 16
24. [27]Issue 17
25. [28]Issue 18
26. [29]Issue 19
27. [30]Issue 20
28. [31]Issue 21
29. [32]Issue 22
30. [33]Issue 23
* [34]Summary of Action Items
_________________________________________________________
<smaug> hi
<smaug> well, who am I then o_O
<smaug> pong
<burn> trackbot, start telcon
<trackbot> Date: 03 November 2011
<Milan> ScribeNick: Milan
Review recently sent examples
<DanD>
[35]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#introduction
[35] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#introduction
<mbodell> [36]http://bantha.org/~mbodell/speechxg/example1.html
[36] http://bantha.org/~mbodell/speechxg/example1.html
Michael: Speech Web Search Markup only
Robert: Found addGrammarFrom() is awkward
... really a hint
Glen: True that input has no grammar
Michael: It's a builtin grammar
Robert: What about derviveGrammarFrom
Glen: It's an append grammar
DanD: Option might be a better example
Michael: Text is a grammar
Robert: Assume q is an object from which a grammar can be derived
<smaug> Nit, <button name="mic" onclick="speechClick()"> is a submit
button, so when you click it, the form is submitted. type="button"
would fix the problem
DanB: addDerivedGrammar
Debbie: Figgure out semantics first
Robert: AddDerivedGrammarFromID
Glen: Also rename q to 'inputField'
... Also from text input type to date or somethign more contrained
... Need to specify the lack of grammars
... Is this dictation?
Robert: improve example by defaulting to UTF-8
<glen> Section 5.1: when no grammar specified, defaults to
builtin:dictation
Robert: Base 64 encoding is ugly
... to the point where it is unsualbe
Michael: Worried about directly inserting XML due to 8th bit
DanB: Are there already common protocols for inserting strings
derived from URLs into local variables?
Glen: Should only be a W3C standard, implmentation is orthoginal
Robert: AddFromString() would be nice:?
Glen: addStringGrammar() and addElementGrammar()
Avery: Perfer longer name because its truer to form
<smaug> Couldn't you just prepend "data:application/srgs+xml," to
the serialized XML. But anyway, using data urls is kind of hackish,
IMO.
Robert: Too many dots to get the interpretation
Milan: Propose addGramamrFromURI()
Robert: Newing up a speech grammar is better approach
Michael: Let's just raise issues now rather than solve them
Debbie: Example is complex, and gets mixed up with arguement that JS
is complex
* laptop?
Michael: Next example from Bjorn
Robert: The example lacks a grammar
<smaug> s/onclick="startSpeech"/onclick="startSpeech(event)"/
<DanD>
[37]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0008/web-speech-sample-code.html
[37] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0008/web-speech-sample-code.html
Robert: Need to define what happens when lacking a grammar
Avery: Is there a policy against comments in the examples?
Michael: Planning on adding examples to an appendix
Avery: It's a decent example, as long as it is clear that this
instance lacks a grammar
Robert: Example shows default behavior
Rahul: Could also delete button as means of shorting example
<glen> per Avery's suggestion: add a comment "since no grammar is
specified and no element is binded, uses default grammar
builtin:dictation"
Rahul: Two different ways to perform same array access
Glen: Should make it consistent in example
<mbodell> In Bjorn's second example need sir.maxNBest = 2;
<glen> use same notation: s/q.value =
event.result.item(0).interpretation;/q.value =
event.result[$1\47].interpretation;/
Robert: Intent is to get a text transcript of the user's input
... why are we accessing the interpretation instead of tokens?
Milan: Need to bring this up in protocol team
<all agreed> to replace to "utterance" in place of interpretation
Milan: Last two comments should apply here as well
... Should we have company-specific references?
Michael: Prefer example.org
Robert: Is there speech recognition in turn by turn>
Michael: Speech recognition is just destination capture
<smaug> Again, s/onclick="startSpeech"/onclick="startSpeech(event)"/
Robert: The prefer speek next instruction should cancel last
instruction
Glen: Thought the purpose of example was to show interplay between
speech and tts?
Michael: TTS play resumes where last left off
Glen: Way to stop prior play is a good feature
... we should change this example
<glen> change example to show how to stop, by persisting the tts
object and calling stop before adding .text and .play
Michael: Ollie example next
<mbodell>
[38]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0009/htmlspeech_permission_example.html
[38] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0009/htmlspeech_permission_example.html
Micahel: First example is just removing unauthorized elements?
... but second example doesn't allow speech input to start
Ollie: Yes
Michael: Can you transition from not authorized to authorized?
Ollie: Should be possible, but example doesn't do that
... but could also just reload the page
* Going on break now
<inserted> scribe:ddahl_
<scribe> scribe:ddahl
Robert's example
robert: two recognitions in a row, you want to pick your cities
based on what state you're in.
<Avery> Actually I think it's based on what state is specified in
the first reco, not necessarily what state you're in. A minor nit.
robert: it really should say "interpretation.state", not just
"interpretation"
... used push instead of adding things to the array of speech
grammars
... a bug on result, should be city, also, sr.onMatch should be
sr.onResult
... second example is rereco
... gives grammars to speechInputRequest, then classifies, then does
rereco with a specific grammar
glenn: this seems to be a strange use of "interpretation"
robert: there is a huge universe of grammars
rahul: this is identifying one grammar as different from the others
robert: using the attribute "modal" to activate and deactivate
grammars
... would change the example to get interpretation.classification
... strange to have multiple "modals" as true, think modal might be
a bad idea
speech-enabled email
michael: one interesting thing is that you might get notifications
that you would want to speak to, but without clicking
robert: was mostly thinking about things like "reply", but you could
also imagine saying "read it to me" after notification
... made up a method to cancel TTS
michael: you could just delete the element
robert: what if you set up the element with stuff in it?
glenn: destroy should not be to only way to cancel
Milan's example of protocol
milan: will augment with API calls that trigger protocols
... need a result index of some kind
... then recognizer decides to change its mind and reorders results
... strange to get a "complete" result in the middle of a long
dictation
... result index 0 is the first fragment, then halfway through the
second fragment, the recognizer says the first one is done
... different from MRCP, because in MRCP that means it's the end of
it
... then retracts a result, not sure how to represent this, maybe an
"IN_PRO
... GRESS" message with no payload
... we will put this in the larger document as an example of the
protocol
michael johnston's multimodal use case
<smaug> Could you please paste links to the example here
michael: "I want to go from here to there" is the use case
<smaug> ( would be then easier to read minutes later )
<mbodell> Michael's example:
[39]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0020/multimodal_example.html
[39] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0020/multimodal_example.html
<mbodell> You can walk through the examples from:
[40]http://bantha.org/~mbodell/speechxg/f2f.html which links to
[41]http://bantha.org/~mbodell/speechxg/examples.html which then
walks through the examples
[40] http://bantha.org/~mbodell/speechxg/f2f.html
[41] http://bantha.org/~mbodell/speechxg/examples.html
glenn: it would be good to have a "state" attribute
... the "nomatch" state is more of a result, not a state
... we may need more than one attribute to get results of speech
processing
michael: this also has the EMMA so that you can see the mapping from
EMMA
... this example makes use of a remote speech service
<glen>
[42]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0020/multimodal_example.html
[42] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0020/multimodal_example.html
michael: the EMMA shows the combined speech and gui input
robert: this should be a wss: , that is, a web socket protocol, but
what should we do if someone uses http?
michael: you could get the command right but not the person if you
didn't do the "clickInfo"
Charles Hemphill's example
<glen>
[43]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/0024.html
[43] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0024.html
danD: we should start with the simplest example
Michael Bodell's example 8, translation
<glen> [44]http://bantha.org/~mbodell/translate.html
[44] http://bantha.org/~mbodell/translate.html
<glen> view-source:[45]http://bantha.org/~mbodell/translate.html
[45] http://bantha.org/~mbodell/translate.html
michael: different example of translation
... there's from and to languages, you choose, and then click on
microphone to talk
... there's a progress bar that get's updated
... we're grabbing our language from the selector, we're using a
dictation grammar for whatever language we're using
... where are we doing capture?
glen: wouldn't that be the microphone?
michael: not necessarily, there could be other things like media
streams
glen: is capture necessary or does it just provide more features?
michael: we didn't have any examples of capture from other places,
like from Web RTC
... right now there's no standard for accessing microphone
glen: would like to see default example where we don't have to
explicitly do capture
michael: all examples assume that there's magic for capturing audio
glen: can't we make it so that the magic is what happens by default?
dan: there are many security and privacy issues
... different permissions for getting access to media but also to do
something to the media
michael: this is also raised in some of our issues, we only have a
two sentence note now
... can TTS work on Web Sockets?
robert: yes
michael: on audio start, etc. are in our spec. another issue is that
payload of start, stop events isn't defined
robert: : do we have VU meter events?
michael: no
dan: that came up in Web RTC, they don't have that, but they could
create it
michael: we do have speech-x events for custom extensions
robert: most speech apps have one
michael: is that part of the UA or the app?
Debbie's example
multi-slot filling
<mbodell> Debbie's:
[46]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/att-0031/Multi-slotSpeech1.html
[46] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0031/Multi-slotSpeech1.html
debbie: in this example you have to pull out the slot values from
the EMMA
robert: is this the same as saying "interpretation.booking"?
debbie: not sure
... we don't know what's in "interpretation"
robert: we could get rid of "interpretation"
michael: it could be a useful pointer into the EMMA
... that is available in VXML
<mbodell> Issue: we should make sure it is clear what the
interpretation points to
<trackbot> Created ISSUE-1 - We should make sure it is clear what
the interpretation points to ; please complete additional details at
[47]http://www.w3.org/2005/Incubator/htmlspeech/track/issues/1/edit
.
[47] http://www.w3.org/2005/Incubator/htmlspeech/track/issues/1/edit
michael: should do an if to make sure that you really got a value
debbie: could add the EMMA
... would there be value in some kind of convenience syntax so that
you don't need the full DOM generality to manipulate the EMMA
result?
<mbodell> Charles' example:
[48]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/0033.html
[48] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0033.html
another example from Charles Hemphill
michael: the same example as before but with an external grammar
avery: what's the advantage of having "reco" element as a child
under "input"
michael: there are two different ways to do the same thing, with
"reco" under as a "child" under <input> you don't need an id
<smaug> <input> element can't have child elements
actually, input is a child of reco in the proposal
<smaug> My comments to example 3
[49]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov
/0034.html
[49] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0034.html
michael: another example with a real inline grammar so that you
don't have to do data uri
... we would have to define a "grammar" tag
robert: we would have to define for browsers how to interpret SRGS
avery: like putting script in page vs an external reference
<smaug> Milan: remember, we're talking about HTML here, not XML
<smaug> (I assume that was Milan)
milan: could we say "as long as this is valid XML ignore it and pass
it to us"?
robert: why wrap the whole thing with the grammar element?
michael: if there's an SRGS 1.1, you wouldn't know what version it
was, for example
... would like to have inline grammar, if any, be full SRGS with
<grammar> element
... that is the end of the examples
<Milan> * Good point Ollie
<glen> scribenick: glen
issues
burnett: if can't agree, depends on importance. If important,
capture different opinions in doc.
... (not required to resolve everything in incubator group)
<mbodell> First issue to discuss:
[50]http://bantha.org/~mbodell/speechxg/issuep1.html
[50] http://bantha.org/~mbodell/speechxg/issuep1.html
1. What Content-Type do we want to use on an empty message? Use case
was nulling out previous candidate recognition.
milan: do we have to specify? can it be assumed?
... empty means no payload?
robert: protocol doesn't require a body
... in which case I don't think it needs a content type. Example
getParams
michael: what about interim results?
... if no content and not content type then nulls out corresponding
result. Example: an interim result gets replaced with no result
(e.g. if a <cough> is initially recognized as some text)
Protocol Issues
2. I am skeptical about changing established MRCP event/method
names. I sort of agree that LISTEN is better than RECOGNIZE, but do
not think the reasons are good enough to warrant ensuing churn.
Robert: Microsoft doesn't care if similar to MRCP, rather that it's
compatible with our web sockets protocol
burnett: web sockets is just a transport
... violates many types of protocol design
... if standards track, IETF is a logical place
robert: so naming doesn't matter much at this point.
all: agree
burnett: some talk of using SIP to setup, would have to separate
signaling and data...which is one thing wrong with this.
robert: this is more to illustrate a point that it can be done
burnett: companies could implement today, and may not be completely
interoperable (as is often the case on first implementations)
michael: we agree, not to change names right now. Names will likely
to be re-evaluated in a standards track.
... minor syntax issues can be called out as a note in the doc.
burnett: when gets into a standards group, they look at requirements
and take ideas into consideration, but they consider MANY other
factors, e.g. security, that drive
3. We need a way to index the recognition results. I suggest using a
Result-Index header
all: agree to add. if a one-shot recognition, it's only [$1\47] and
still optional
4. It was awkward to use a RECOGNITION-COMPLETE message presumably
with a COMPLETE status during continuous speech. Instead, I used
INTERMEDIATE-RESULT with a new Result-Status header set to final.
robert: just rename RECOGNITION-COMPLETE as RECOGNITION-RESULT
... it's an intermediate, unless it's a final response type.
burnett: MRCP has separate status code and completion code
Milan: we need a complete flag, not sure it was defined. We haven't
stated which status codes correspond to which messages.
burnett: in MRCP, status is about communication (like 200 OK). In
MRCP, the completion code indicates what happened (e.g. successful
reco)
robert: so status indicates "sending more", so status should be
in-progress for continuous reco case.
... need request state?
burnett: request has been made, has it been completed yet? status is
success, illegal method, illegal value, unsupported header
robert: reco result, 200 OK, in progress
5. Perhaps Source-Time should also be required on final results
all: yes, everything's fine, more to come
Milan: by time have final result, should know start time.
all: agree, require only reco result
Milan: could be reco result with type = pending
michael: pending implies have already started
robert: in progress more accurate
all: agree to leave as is
6. Wanted to confirm that channel identification is being handled by
the WebSocket container
robert: handled by web socket
... if two separate recos, then two web sockets and two audio
streams. (Can have 2 grammars active in one reco)
milan: continuous hotword case
robert: that's continuous reco
... start session with hotword and command-control grammar, all is
continuous results
michael: hard if change over time
... because have to pause to change
... so not continuous
robert: don't want to transmit audio twice, but with two sessions,
you must
avery: does emma result specifies which grammar?
michael: yes
7. I noticed that Completion-Cause was missing from Robert's spec
example in section 4.2.
robert: accidental omission, need to add
Web API Issues
1. To get the reco result I think i have to write
"e.result.item(0).interpretation". This is a lot of dots and an
index just to get the top result.
robert: I want to write e.interpretation -- because most of the time
that's what I want (but still could use the verbose way as well)
<mbodell> Here is the link to where the event is defined:
[51]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#speechinputresult
[51] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#speechinputresult
milan: e.result.interpretation
michael: can already use e.result[$1\47].interpretation
glen: we should change utterance to match
all: e.interpretation and e.utterance
... agreed
2. "utterance" has a couple of different meanings in the doc. It's
alternatively the recording of what the person said, or the
transcript returned by the recognizer.
michael: transcript? text? tokens?
text and token are over used and confusing
robert: but it is text, so not overloading the concept
... (unlike token)
burnett: transcript, closest to what's actually happening, laymen
get it
glen: text is not descriptive: interpretation is text, whereas
transcript vs interpretation is clear
all: agree: rename utterance to transcript
5. The "modal" attribute on SpeechGrammar is unnecessarily
restrictive
Discussion: There are cases where I'll want to have multiple
grammars active, but not all, and not just one. Developers would be
better off with a boolean enabled attribute on each grammar. Would
be useful to clarify the behavior when there is more than 1 grammar
with this set to true (only the first in the list is active?) Is
this even useful at all? What is the case for having grammars which
aren't active in the reco? Can we change the state of the modal/u
robert: less lines of code if just set one to true
milan: alternatively, could add/remove from grammars array
glen: sending all at once allows caching
... of grammars
... what about continuous case, can grammars change on the fly
michael: we decided to simplify by re-calling .start to change
grammars or anything else
milan: should have a separate way to preload
burnett: voicexml has defineGrammar
milan: grammar set object on the SpeechInputRequest
... I proposing sets of grammars
robert: I'd like it flatter, get rid of enabled/disabled -- just
delete -- and don't allow preload
michael: already have .open that allows preloading
<mbodell> See web api:
[52]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#dfn-open
[52] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#dfn-open
<scribe> scribenick: glen
burnett: nervous about this, we discussed this for a long time and
considered many edge-cases
robert: alternative: get rid of modal and enable, and just use a
bunch of grammars
avery: if .open has already been called, .start doesn't call it
(.start only calls .open if hasn't been opened yet)
burnett: wondering if there are performance advantages in reco
engine if can call enable/disable as opposed to calling .open
multiple times?
<smaug> start might need to re-call open if authorizationState has
changed from not-authorized to authorized
milan: MRCP didn't solve this, why should we?
... all good MRCP clients do what you're saying automatically, they
automatically check for the deltas
robert: this runs at web-scale, distributed
... big difference between telephony and web
michael: options: eliminate model, keep and define what happens if
multiple set to true
avery: easier to add later than remove
agree: eliminate .modal
5. The interpretation attribute is completely opaque. That may be
necessary given that SISR can pretty much return anything. But it'll
need some examples to show how to use it.
burnett: there was support for a flat array of interpretations
... I didn't like that, Nuance and their customers didn't like it,
debbie: use emma to define layout
michael: different reco engines may use emma in different ways
... fundamentally, .interpretation points to somewhere in emma,
which simplifies (and a corresponding .transcript)
all: agree, specify which part of emma holds the interpretation
michael: mapped to a DOM object, emma literal or node
... like debbie's slot-filling example
... I will send text for this
5. The array of SpeechGrammar objects is too cumbersome
<smaug> something happened to the audio. it is all just noise
<smaug> though, getting late here
Discussion: Robert: The array of SpeechGrammar objects is too
cumbersome. In most cases I'd like to write something simple like:
mySR.speechGrammars.push("USstates.grxml","majorUScities.grxml","maj
orInternationalCities.grxml"); But I can't. I have new-up a separate
object for each one then add it to the array, even when I don't care
about the other attributes. Better to just make it an array of URI
strings, and add functions for the edge cases. e.g. void ena
void setWeight(in DOMString grammarUri, in float weight); And yeah,
I remember arguing the opposite on the phone call. But that's before
I tried writing sample code. Glen: "The uri of a grammar associated
with this reco. If unset, this defaults to the default builtin uri."
Presumably using the grammar attribute overwrites the default
grammar, so if a developer wishes to add a grammar that supplements
the default grammar, then this alternative should work: re
would add clarity. Michael: If you view source on the web api
document you'll see the grammar functions and descriptions are there
commented out as I anticipated, and agree, with this comment. We
should have both functions and array/collections and this makes the
things that Robert and Glen describe much easier/better.
michael: grammar spec after ? are hints, before builtin: are
required and errors if not supported
...example: builtin:contacts may recognize names in smartphone
... require built:generic
burnett: built:generic means I'll take anything you got: if it's
just a date grammar, I'll take it.
<mbodell> We are talking about
[53]http://bantha.org/~mbodell/speechxg/issuew5.html but really more
about what happens with no grammar
[53] http://bantha.org/~mbodell/speechxg/issuew5.html
milan: builtin:generic could respond with failure, builtin:dictation
could also respond with failure
robert: builtin:generic should be builtin:default
... and none specified is builtin:default
burnett: what if want to use both default and another grammar
glen: then add builtin:default and builtin:foo
michael: default is not user default, but service or ua default
milan: want a way to record without a grammar
michael: we define builtin:default, encourage vendors to implement,
and state when none specified, it's on by default. (and when other
grammars specifed, it can also be added.
... I like .addGrammer(url, weight) as a simplification from
creating object and then setting it
robert: .addGrammarFromUrl(url, weight)
... .addGrammarFromElement(element, weight)
.addGrammarFromString(string, weight)
... better yet: .addUrlGrammar .addElementGrammar .addStringGrammar
... but advantage for objects to be alphabetical order, grouped
together in docs
glen: .addGrammarUrl .addGrammarElement .addGrammarString
... remove is a JavaScript array operation
michael: also .addCustomParameter(name, value)
all: agree: .addGrammarUrl .addGrammarElement .addGrammarString
.addCustomParameter
<smaug> I think this is enough for me. I'll read the minutes
tomorrow and send comments
<smaug> It is midnight here
<smaug> dark? it has been dark hear for the last 6 hours
<smaug> here
<rahul> scribenick: rahul
Issue 6
<mbodell> Link to the current issue:
[54]http://bantha.org/~mbodell/speechxg/issuew6.html
[54] http://bantha.org/~mbodell/speechxg/issuew6.html
<glen> 6. The names are a bit long.
<glen> Discussion: e.g. "new SpeechInputRequest()" vs "new
SpeechIn()" . e.g. "mySR.speechGrammars.push("foo")" vs
"mySR.grammars.push("foo")" . e.g. "resultEMMAXML" vs "EMMAXML" or
just "EMMA" (call the other one "EMMAText" ) e.g. "inputWaveformURI"
vs "inputURI"
Milan: how about SpeechRequest instead of SpeechInputRequest?
Robert: SpeechRecognizer?
Milan: AudioSynthesizer?
Glen: SpeechReco?
<Milan> Milan: AudioSynth
<Milan> * test
Resolution: We will use SpeechReco instead of SpeechInputRequest
<matt> [55]Parkinson's Law of Triviality
[55] http://en.wikipedia.org/wiki/Bikeshedding
<scribe> ACTION: Editing team to update to SpeechReco [recorded in
[56]http://www.w3.org/2011/11/03-htmlspeech-minutes.html#action01]
<trackbot> Sorry, couldn't find user - Editing
Issue 7
7. SpeechInputRequest.outputToElement() should be an attribute,
perhaps 'forElement'
<matt> [57]Issue 7
[57] http://bantha.org/~mbodell/speechxg/issuew7.html
Resolution: Replace outputToElement() function with the
outputElement attribute
Issue 8
<inserted> [58]Issue 8
[58] http://bantha.org/~mbodell/speechxg/issuew8.html
8. SpeechInputResult has a getter "item(index)".
SpeechInputResultEvent has an array "SpeechInputResult[] results.
Discussion: Can we change both to be collections similar toÃ
[59]http://www.w3.org/TR/FileAPI/#dfn-filelistà (accessible via []
operator and optionally with a .item() method)?
[59] http://www.w3.org/TR/FileAPI/#dfn-filelist
Resolution: Accepted
Issue 9
<matt> [60]Issue 9
[60] http://bantha.org/~mbodell/speechxg/issuew9.html
9. The <reco> element should probably be a void element with no
content on its own
Discussion: Satish:Ã
[61]http://dev.w3.org/html5/spec/Overview.html#void-elements. I just
noticed this in the for attribute's description, missed it in
earlier reads: "If the for attribute is not specified, but the reco
element has a recoable element descendant, then the first such
descendant in tree order is the reco element's reco control." Is
there a benefit to doing this over requiring the 'for' attribute to
be set and making reco a void element? Charles: I a
[61] http://dev.w3.org/html5/spec/Overview.html#void-elements.
<glen> resolution: can specify with either descendent or with for=
attribute
Resolution: Agreed to leave it as-is using either the for or the
descendant pattern
Issue 10
<matt>
<inserted> [62]Issue 10
[62] http://bantha.org/~mbodell/speechxg/issuew10.html
10. TTS is hard
<matt> |[63]http://bantha.org/~mbodell/speechxg/issuew8.html||
[63] http://bantha.org/~mbodell/speechxg/issuew8.html
Discussion: Bjorn: I can't see any easy way to do programmatic TTS.
The TTS element is at least missing the attributes @text and @lang.
Without those, it's pretty hard to do the very simple use case of
generating a string and speaking it. It's possible, but you need to
build a whole SSML document. For use cases, see the samples I sent
earlier today. Dominic: For TTS, I don't understand where the
content to be spoken is supposed to go if it's not specified in
Michael: @lang is not missing since it could be inherited
... there is no @text there
Glen: content within <tts></tts> will show up within older browsers
<mbodell> Discussion is <tts src="data:text/plain,Hello, world"/>
versus <tts value="Hello, world"/> versus something else. Note in JS
we could define a function so it is pretty similar, but from Markup
a little harder to get the function creating the data uri (probably
still possible)
<glen> 72<tts value="fahrenheit">F</tts>
<glen> michael: tts as a markup may render visually a control (play,
stop, etc)
<glen> ...other dom can interact
<glen> glen: most uses of tts need dynamic control -- that is
require javascript
<glen> michael: because tts inherits from media-element, it requires
a src attribute
<glen> glen: <img alt="text">
<glen> michael: <tts> is not used as an alternative fallback
Dan: usecase for <tts> element is to facilitate easy generation as
part of markup rather than generating script
s/|[64]http://bantha.org/~mbodell/speechxg/issuew8.html||//
[64] http://bantha.org/~mbodell/speechxg/issuew8.html
Michael: the @lang inherited from the <media> element should be
passed as a parameter to the synthesizer
Resolution: Add a @text attribute to <tts>.
Issue 11
-> [65]http://bantha.org/~mbodell/speechxg/issuew11.html Issue 11
[65] http://bantha.org/~mbodell/speechxg/issuew11.html
11. How does binding to button work
Discussion: Satish: "When the recoable element is aà buttonà then if
the button is notà disabled, then the result of a speech recognition
is to activate the button." "For button controls (submit, image,
reset, button) the act of recognition just activates the input."
"For type checkbox, the input should be set to a checkedness of
true. For type radiobutton, the input should be set to a checkedness
of true, and all other inputs in the radio button group must b
Michael: propose to have an issue note that this needs further
thought
Robert: define what we can, and for others say there is no binding
Resolution: Add issue note that more work to be done on bindings
Issue 12
12. What about meter, progress, and output elements?
Discussion: Satish: The meter, progress and output elements all seem
to be aimed at displaying results and not for taking user input. Is
there a reason why these are included as recoable elements?Michael:
This is specified atà Reco Bindings. A person could want to be able
to speak and have it change a progress bar or meter or output
element. The primary reason is matching what is done with label.
These are all labelable elements and thus ended up as recoable
Glen: suggest we not talk about bindings to these
Dan: we need to decide which ones to leave out, I agree since these
are not even <input> elements
Resolution: Remove these from the recoable elements and bindings
Issue 13
13. grammars and parameters should be collections
Discussion: Satish: Similar toà issue 8, SpeechInputRequest
attributes 'grammars' and 'parameters' should probably be turned
into a collection as well
Resolution: Accepted
Issue 14
14. rename language to lang
Discussion: SpeechInputRequest.language should probably be changed
to 'lang' to matchà lang attributes.
Resolution: Accepted
Issue 15
15. rename iterimResults to interimResultsInterval
Discussion: SpeechInputResult.interimResults should probably be
renamed to interimResultsInterval to indicate its usage similar to
how other attributes have 'Timeout' in their names
Resolution: Turn into boolean property, name does not change
Issue 16
16. drop enum prefixes
Discussion: SPEECH_AUTHORIZATION_ prefix could be dropped for the
enums and just have 'UNKNOWN', 'AUTHORIZED' & 'NOT_AUTHORIZED'
(similar toà XHR States). Same for SPEECH_INPUT_ERR_* and other such
enums.
Resolution: Accepted (given Satish's input and expertise)
Issue 17
17. A way to uncheck automatically by speech?
Discussion: Glen: "For type checkbox, the input should be set to a
checkedness of true." It would be nice to have a way to allow user
to say something to set it to false, but I can't think of a good
convention for this other than adding an attribute or grammar.
Perhaps this could/should only be possible via scripting. (I don't
like the idea of toggling the checkbox because some users may not be
able to easily observe what state the checkbox is currently in.)
Resolution: See resolution to issue 11
<smaug> mbodell: I'm kind of online
<smaug> what enum conflicts?
<smaug> if the const is in an interface, then no
Issue 18
<inserted> [66]Issue 18
[66] http://bantha.org/~mbodell/speechxg/issuew18.html
18. Binding hints versus requirements
Discussion: Glen: "For date and time types ... type of color ...
type of range the assignment is only allowed if it is a valid ..."
On our call we discussed how these grammars are hints, and in
particular how pattern may be difficult to implement. We discussed
that showing an output response, even an invalid one, may be more
valuable than no response. Michael: We can do hints for patterns on
text, and for numbers out of range, but for other types HTML5 is jus
Resolution: See resolution to issue 11
<glen> satish provides this example of two sets of enums, with no
prefixes.
<glen> [67]https://developer.mozilla.org/en/DOM/HTMLMediaElement
[67] https://developer.mozilla.org/en/DOM/HTMLMediaElement
Issue 19
19. Does reco and TTS need to be on a server as opposed to client
side?
Discussion: Dominic: The spec for both reco and TTS now allow the
user to specify a service URL. Could you clarify what the value
would be if the developer wishes to use a local (client-side)
engine, if available? Some of the spec seems to assume a network
speech implementation, but client-side reco and TTS are very much
possible and quite desirable for applications that require extremely
low latency, like accessibility in particular. Is there any
possibility
<matt> [68]Issue 19
[68] http://bantha.org/~mbodell/speechxg/issuew19.html
<glen> satish continues: HTMLMediaElement.LOADED so no clashes
<glen> (above refers to issue 16)
Resolution: The service does not need to be remote, UAs may define
URIs to local engines. We should add clarifying text specifying
this. Also, the serviceURI does not need to be remote. We will
clarify this as well.
Issue 20
20. Set lastMark?
Discussion: Dominic: An earlier draft had the ability to set
lastMark, but now it looks like it's read-only, is that correct?
That actually may be easier to implement, because many speech
engines don't support seeking to the middle of a speech stream
without first synthesizing the whole thing. Michael: Actually the
speech xg version has never supported setting a lastMark. You can
control playback using the normalà mediaà controls (setting
currentTime, seekabl
Resolution: Leave as-is right now. Add issue note about also making
it writeable.
Issue 21
21. More frequent callbacks?
Discussion: Dominic: When I posted the initial version of the TTS
extension API on the chromium-extensions list, the primary feature
request I got from developers was the ability to get sentence, word,
and even phoneme-level callbacks, so that got added to the API
before we launched it. Having callbacks at ssml markers is great,
but many applications require synchronizing closely with the speech,
and it seems really cumbersome and wasteful to have to add an s
Resolution: Leave as-is. Suggest as enhancement to SSML.
Issue 22
22. How do we fit with capture/input/MediaStream?
Discussion: Michael: Our spec has:Ã attribute MediaStream input;Ã
but we have nearly no explanation of it and our examples don't show
how to use it. Can we do better?
<mbodell> spec link:
[69]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct
/att-0064/speechwepapi_1_.html#dfn-input
[69] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#dfn-input
Resolution: This XG probably can't do better. We should have an
issue note and include the assumption that media stream input
somehow happens.This seems to be of interest to numerous groups
(Audio, DAP, Web RTC, HTML Speech XG ...), Debbie will follow up as
part of the HCG.
<scribe> ACTION: ddahl2 to set up follow-up via HCG [recorded in
[70]http://www.w3.org/2011/11/03-htmlspeech-minutes.html#action02]
<trackbot> Created ACTION-4 - Set up follow-up via HCG [on Deborah
Dahl - due 2011-11-11].
Issue 23
23. How does speechend and related events do timing?
<matt> [71]Issue 23
[71] http://bantha.org/~mbodell/speechxg/issuew23.html
Discussion: Michael: Our spec is missing explanations around the
timing and how the information is reflected.
<scribe> Meeting: HTML Speech Incubator Group - 2011 TPAC F2F, Day 1
Resolution: define the data to reflect the source-time back into the
events. Do it on all events that accept time (including result and
speech-x). Note this timing is always relative to the "stream-time"
and real time may be faster or slower than that.
<kaz> [ Thursday meeting adjourned ]
Received on Friday, 4 November 2011 18:31:39 UTC