Re: Notes from today's protocol call from JOHNSTON, MICHAEL J (MICHAEL J) on 2011-06-29 (public-xg-htmlspeech@w3.org from June 2011)

From: JOHNSTON, MICHAEL J (MICHAEL J) <johnston@research.att.com>
Date: Wed, 29 Jun 2011 19:49:05 -0400
To: Robert Brown <Robert.Brown@microsoft.com>
CC: "Young, Milan" <Milan.Young@nuance.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>
Message-ID: <D4EAA93A-5D31-4062-A59F-16626AAE58DA@research.att.com>

I agree with Robert that rather than strictly adhering to MRCP2, we should be
taking elements of it as an inspiration. What we are implementing is already not
MRCP2, a lot of methods of MRCP2 are left out, and our audio transport is
through the same connection as the control message rather than in separate RTP.
MRCP2 was designed for a very different purpose and should not limit what
we do in this work.

More general point, in designing a standard to support addition of speech to
graphical interaction, we should make sure we can support at the very least
the most basic kinds of multimodal interaction. For example, we need a way to
pass other input streams such as button presses or ink along with speech input
to a speech service.  How would be achieve that with the current protocol proposal?

On Jun 29, 2011, at 1:30 PM, Robert Brown wrote:


inline...
________________________________
From: Young, Milan [Milan.Young@nuance.com]

Inline…

________________________________
From: Robert Brown [mailto:Robert.Brown@microsoft.com]
Sent: Thursday, June 23, 2011 5:04 PM
To: Young, Milan; HTML Speech XG
Subject: RE: Notes from today's protocol call

One other thing we’ll need to consider is how to add/remove grammars during continuous recognition.

Some use cases:
•         In dictation, it’s not uncommon to have hot words that switch in and out of a command mode (i.e. enable/disable a command grammar).
[Milan] I was figuring this could take place by enabling more than one recognition session.  You have the main dictation session going, and then a parallel channel(s) performing hotword.
[Robert] That could work. But it could also be messy. For example, if you got a match on both sessions, the app would then need to decide which one to act on. The recognizer would be the better resource to make that decision.



•         In open-mic multimodal apps, the app will listen continuously, but change the set of active grammars based on the user’s other non-speech interactions with the app.
[Milan] Yes, I’ve thought about this scenario as well.  The problem is that it’s such a divergence from MRCP2, that it would be hard to retrofit onto existing stacks.  I also haven’t heard much discussion of this feature at the API level, but perhaps I’m wrong.
[Robert] Divergence is okay. Existing stacks weren't designed for these scenarios, but we shouldn't let that invalidate the scenario. The recognizer state machine will need to be more complex than it is in MRCP. Moreover, any existing recognizers that are capable of continuous recognition by definition don't currently use MRCP anyway because it doesn't support the scenario.

[Robert] How about we do something like the following?
1. redefine SET-GRAMMAR so that it can also be used during a recognition, and so that it returns a handle that can be used to refer to the grammar later
2. add a method to set grammar state.  For example, call it "SET-GRAMMAR-STATE", and give it the following capabilities:
    a. enable/disable named top-level rules within a grammar
    b. enable/disable entire grammars

Received on Wednesday, 29 June 2011 23:49:42 UTC