W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > September 2011

[minutes] 1 September 2011

From: Dan Burnett <dburnett@voxeo.com>
Date: Thu, 1 Sep 2011 15:08:40 -0400
Message-Id: <7AC5DBDF-23BC-4CFB-8E7F-9CB7C4338745@voxeo.com>
To: public-xg-htmlspeech@w3.org
Group,

The minutes from today's call are available at http://www.w3.org/2011/09/01-htmlspeech-minutes.html.

For convenience, a text version is embedded below.

Thanks to Glen Shires for taking the minutes.

-- dan

**********************************************************************************

            HTML Speech Incubator Group Teleconference


01 Sep 2011

   [2]Agenda

      [2] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0038.html

   See also: [3]IRC log

      [3] http://www.w3.org/2011/09/01-htmlspeech-irc

Attendees

   Present
          Dan_Burnett, Olli_Pettay, Milan_Young, Debbie_Dahl,
          Glen_Shires, Dan_Druta, Charles_Hemphill, Michael_Bodell

   Regrets
          Robert_Brown

   Chair
          Dan_Burnett

   Scribe
          Glen_Shires

Contents

     * [4]Topics
         1. [5]Topics remaining to be discussed
         2. [6]Is audio recording without recognition a scenario to
            support?
         3. [7]Preloading of resources
         4. [8]Feedback mechanism for continuous recognition
         5. [9]Extending our group's charter
         6. [10]Charter Extension Status
         7. [11]Whether nomatch, noinput are errors or other conditions
         8. [12]How are top-level weights on grammars interpreted?
     * [13]Summary of Action Items
     _________________________________________________________


Topics remaining to be discussed

   [15]http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech
   -20110629.html#topics

     [15] http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110629.html#topics

Is audio recording without recognition a scenario to support?

   burn: other APIs cover this
   ... in our charter: record audio that's recognized; but fine if we
   specify that we don't specifically capture audio without recognition

   milan: so could recognize & capture audio, and ignore the reco
   results -- but this may require a license

   debbie: design decision 85: already decided we don't just capture
   audio

Preloading of resources

   <Zakim> ddahl, you wanted to point out that we already have DD 85

   milan: we may not need an explicity API, but some things may preload
   implicitly - not sure exactly how this would be implemented
   ... I believe preload is necessary sometimes, and is in scope, and
   we need notification when complete

   olli: agree

   burn: in voicexml, author makes hint that grammar unlikely to change
   before being used (platform may use or ignore this hint)

   olli: need to know when loading is complete, so recognition button
   does something [quickly]

   burn: but that's in direct conflict with the "hint" concept. An
   author who has a changing grammar would prefer that the most
   up-to-date grammar always be used, even if adds time delay.

   olli: use cases for both

   burn: I agree. Do we need anything explicitly in the API for this?
   This is not an optimization, it's a user-affecting behavior that
   author may wish to specify

   milan: I don't think we need it, but if others feels strongly, I
   don't object

   olli: API considerations mean an event back to indicate preload is
   complete

   burn: summarizing, may want to know all grammars are loaded before
   display a "recognize" button. So author may need to request
   preloading and get a notification back.

   michael: I understand, but think of it more as "prepare grammars"
   rather than "preload"

   burn: so if get an event back, author can determine how to handle
   the event
   ... so API must support author requesting "prepare grammars" and
   getting an event back.
   ... indicating completion
   ... we agree with this as a design decision
   ... does it apply to anything besides grammars?
   ... voicexml has a TTS - fetch audio (in some cases, it may not be
   available)
   ... pre-recorded or streamed

   glen: could have different voices or languages to preload

   burn: yes, but seems different to me, they don't change as
   dynamically as grammars

   olli: author needs to know everything (system) is loaded before
   initially beginning

   burn: comparable to streaming video or audio - buttons for playing
   are ghosted out until stream/resource is ready

   charles: recognizer may be local and may need to wait for models/etc
   to load

   burn: practically I'm trying to understand what differs here from
   voicexml
   ... local vs server is not clear-cut: sometimes files in mixed
   locations

   michael: we are all remote

   milan: nuance all local

   burn: I'm swayed less by infrastructure details then user-affecting
   details
   ... tradition in graphical web world is that buttons only visible
   when corresponding resources are available

   michael: user agent could buffer if reco/grammars/etc not ready

   burn: what about TTS

   olli: what if server down

   michael: sometimes web interface is not that, instead click "play"
   and wait to download, or find that it's not available

   burn: agree, users are accustom to audio not playing immediately
   ... it's a significant task to know that everything is completely
   ready: grammars, recognizer, audio files, etc

   olli: do we have an event for recognizer starting?

   burn: olli, is there a way today in HTML to know if an audio file
   exists?

   michael: it's only a hint, may be wrong

   <burn> in answer to olli's question about recognizer starting,
   Charles said yes

   michael: in HTML5 there is a buffer attribute to query to see how
   much in buffer, but can't tell it to buffer and user-agents can
   discard buffer, so it's all hueristics
   ... not enforced

   burn: trying to remember, do we have a way to specify playing an
   audio file, how close is our current spec to HTML?

   michael: I think close, because we inherit from media.

   burn: Is there any need (DD)

   <mbodell> we inherit from HTMLMediaElement which has the attributes

   michael: properties like preload and buffer useful for synthesis to
   inherit from

   burn: any other resources to preload, or any general statement on
   preloading?
   ... in VoiceXML, first call behavior: the first time/page is called,
   it may not be ready, but assuming they are not changing
   dramatically, everything is loaded the second time.
   ... We (Voxeo) and other vendors recommend to customers to
   "run-once" (automated or not) to get all loaded on first call.
   ... Web browser different, but if for example, at a conference, you
   preload videos so they play quickly (e.g. start playing and pause).
   ... I don't know of any equivalent for having a recognizer be ready.
   ... I'm not proposing any particular solution here. Anyone want to
   add anything else?

   michael: grammars is the most expensive thing related to
   recognizers. Input is more forgiving than output because can buffer
   and then catch-up.

   burn: so it's a performance issue, not a UI issue.

Feedback mechanism for continuous recognition

   burn: DD 74
   ... replace mechanism not for user feedback, but rather server to
   client

   milan: a final result is final - nobody was motivated to spec all
   this out in protocol discussions

   burn: how motivated is group to define a feedback mechanism?

   michael: reco correcting itself

   burn: to me, reco correcting itself is feedforward. I'm asking if we
   need a way for client to inform server that something was wrong.

   milan: could also be done as vendor params.

   michael: if we can standardize, makes sense. Google proposed and
   Microsoft interested.

   milan: needs to be a hint to recognizer, not a requirement for
   recognizer to do anything

   michael: agree
   ... won't require changing recognizer results

   burn: final means final

   milan: final unless we have this feedback - but I'm reluctant to
   open this can of worms

   burn: what if recognizer has not reached a final state, but client
   provides feedback, then as long as recognizer has not made it final,
   it can change.

   milan: not common case, users can't change that fast.

   michael: not necessarily
   ... it's a hint. recognizer can do with it what it needs to.

   burn: client to recognizer feedback mechanism is a hint --
   recognizer can do with hint whatever it needs to. Final is still
   final, so can't change past finalized results.

   glen: agree, a hint for recognizer

   milan: agree, a hint

   burn: DD must be a way for client to send feedback about a
   recognition to the recognizer, even while reco is ongoing

   <mbodell>
   [16]http://www.w3.org/2005/Incubator/htmlspeech/2011/05/f2fminutes20
   1105.html#continuous2

     [16] http://www.w3.org/2005/Incubator/htmlspeech/2011/05/f2fminutes201105.html#continuous2

   burn: also, I believe we agree that there is a point at which a
   result is final and can't be changed. I'm trying to find the DD for
   that.

   michael: I don't think there was a DD on that. As long as continuous
   reco is ongoing, results can change.

   milan: but sending only interim results, requires longer and longer
   results to be returned for long continuous recognition.

   glen: could implement so that interim results are "semi-final" and
   thus don't have to re-send entire result each time, but still not
   "final" so that can change if necessary.
   ... so the question here is whether we want to add this complexity
   to the spec.

   michael: agree, we did discuss, but not make a decision on this at
   face to face.

   burn: we need to discuss this further on mailing list or in future
   call.

Extending our group's charter

Charter Extension Status

   burn: I spoke with Coralie Mercier and set to go. Not clear what we
   are using TPAC discussion for. Charter officially extended to end of
   November.
   ... However, expectation, group will wrap up work before TPAC and
   publish right after TPAC.
   ... tech discussions in Sept, Oct for editorial and wrap-up. Publish
   right after TPAC. Can publish before end of November.
   ... I submitted a paragraph on our accomplishments: DD, web api,
   html extensions and protocol, we plan to complete and wrap-up in a
   report.
   ... she is expecting to publish this paragraph this week. she
   re-assured our charter is intact and this is a formality.

Whether nomatch, noinput are errors or other conditions

   michael: we discussed and decided to make not errors

   burn: let's capture as DD if we don't have one...which we apparently
   don't. So we'll record this as DD.

How are top-level weights on grammars interpreted?

   michael: have in API ability to add weights, but haven't defined
   what they mean

   burn: can anyone propose something?

   milan: in voicexml, this is vendor specific

   burn: I'm fine with not defining

   <mbodell> A weight is nominally a multiplying factor in the
   likelihood domain of a speech recognition search. A weight of "1.0"
   is equivalent to providing no weight at all. A weight greater than
   "1.0" positively biases the grammar and a weight less than "1.0"
   negatively biases the grammar. If unspecified, the default weight
   for any grammar is "1.0". If no weight is specified for any grammar
   element then all grammars are equally likely.

   <mbodell> Effective weights are usually obtained by study of real
   speech and textual data on a particular platform. Furthermore, a
   grammar weight is platform specific. Note that different ASR engines
   may treat the same weight value differently. Therefore, the weight
   value that works well on particular platform may generate different
   results on other platforms.

   debbie: api section 7.1 says ...
   ... "relative to", but hard to interpret what that means

   <mbodell> The posted text was VXML

   <mbodell> the next text is from our current api spec, 7.1 that
   Debbie mentioned

   <mbodell> This method adds a grammar to the set of active grammars.
   The URI for the grammar is specified by the src parameter, which
   represents the URI for the grammar. If the weight parameter is
   present it represents this grammar's weight relative to the other
   grammar. If the weight parameter is not present, the default value
   of 1.0 is used. If the modal parameter is set to true, then all
   other already active grammars are disabled. If the modal parameter
   is not pr

   burn: let's distinguish between general statements about weights,
   and weights relative to each other. We've always agreed that larger
   means greater weight. But we've never stated what values mean.
   ... not probabilities.

   michael: yes, 2 is not necessarily twice as much as 1

   <Charles> SRGS weight discussion:
   [17]http://www.w3.org/TR/speech-grammar/#S2.4.1

     [17] http://www.w3.org/TR/speech-grammar/#S2.4.1

   burn: two grammars of weight X both have the same weighting,
   whatever that means
   ... if one grammar A has weight X and grammar B has weight Y, and X
   > Y, then grammar A has greater weight than grammar B

   <mbodell> I'm not sure if we want X > Y then X is greater then
   versus greater then or equal to

   michael: should that be greater than, or greater than or equal to.
   Might be a step function. 1.8 and 1.9 might be treated as the same.
   Equal or Greater (but not less).

   burn: "monotonically non-decreasing" is how we described it

   michael: yes

   burn: in the SSML sense
   ... (I don't know that SRGS says that)

   michael: yes, SRGS only says positively and negatively biasing

   burn: DD "monotonically non-decreasing"
   ... we're out of time. Thanks, bye
Received on Thursday, 1 September 2011 19:09:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 1 September 2011 19:09:21 GMT