Re: R23. Speech as an input on any application should be able to be optional from chan@info-cast.com on 2010-12-06 (public-xg-htmlspeech@w3.org from December 2010)

From: <chan@info-cast.com>
Date: Sun, 05 Dec 2010 19:15:00 -0700
To: <Olli@pettay.fi>
Cc: Deborah Dahl <dahl@conversational-technologies.com>, Bjorn Bringert <bringert@google.com>, Dan Burnett <dburnett@voxeo.com>, <public-xg-htmlspeech@w3.org>
Message-ID: <0d0bfe89980abf9db0cee6c505467eda@info-cast.com>
Hello Olli,

My responds embedded below.

Thanks

Chan

On Mon, 29 Nov 2010 22:34:20 +0200, Olli Pettay <Olli.Pettay@helsinki.fi>
wrote:
> On 11/26/2010 03:03 AM, chan@info-cast.com wrote:
>>
>>> On 11/23/2010 04:24 AM, chan@info-cast.com wrote:
>>>> Hello Deborah,
>>>>
>>>> OK, if the speech is optional for that "type=speech" element,
>>>> then text (or other modality?) is assumed here ?
>>>> Or the element won't get any input other than speech ?
>>>>
>>>> What we actually need is an element accepting mulitimodal
>>>> input, assuming both text and speech agents up and running
>>>> for that element simultanesously. Wonder if this use case
>>>> had been discussed before - my apology if it's been,
>>>> as I started following your standard efforts quite lately.
>>>
>>>
>>> It is not at all clear that we should add any kind of element.
>>> What I'm probably going to propose is an JS object, which controls
>>> speech recognition and which is the entry point for
>>> speech recognition related DOM event stream.
>>>
>>> Then it is up to the web app (or script libraries) to
>>> handle multimodal integration; this way we don't
>>> limit multimodal input to speech+keyboard, but also mouse/touch
>>> /whatever events can be handled easily.
>>>
>>> -Olli
>>>
>>
>> Don't have to add a new element to support multimodality elements,
>> but expanding the element's type attribute would be enough.
>> To add speech, the input tag's type attribute may be expanded:
>>
>>    <input type="text|speech" ../>
> 
> That would be backwards incompatible change, if you want to
> support any other type than "text". And note, binding
> speech input to one html input element isn't enough
> in the common case when user wants to fill several fields.
> 
> 
Well, I thought that the Speech HTML assumes a certain degree of
backward compatibility as the extension may have to introduce new
types and attributes. Since the speech is clearly a distinct input
type, it may deserve its own type attribute.

And I think there's a way to deal with this incompatibility, so that
the UA can safely ignore those new extensions. Interestingly enough,
the current web browsers don't even care about this 'speech' extension.
I found that they just ignore the extension to behave correctly even
w/o any compatibility check.

>>
>> where users can enter the desired info with either text or speech.
>> And the user must be able to use the speech modality in hands-free
>> mode, as required by R24. This implies both text and speech channels
>> must be active simultaneously to acquire valid input from either of
>> the channels.
>>
>> Implementing and maintaining such dual channel monitoring process
>> could be a complex task for average web developers, even when a JS
>> library object becomes available.
> Average web developer would use a scrip library for multimodal
integration.
> 
> 
> 
>> Since this process is well-defined
> Is it really well-defined? I do assume the process depends
> on the web application and the interaction model the
> web application provides.
> 
> 
The process I referred to was the very basic mechanism of:

  1. Sampling utterance
  2. Transmitting to SS
  3. Getting the result back - either recognition pass | fail
  4. Presenting the pass-result on the screen, as if it's typed-in
     (taking specified action for failure case)

Other than this, those higher-level interpretation may better be
dealt by the application. This is to support basic speech input in
the simplest manner, with a large set of default configuration.

>> and necessary for all speech-enabled application, this seems to be more
>> of the UA responsibility, than a part of web application. Developers
>> engaged in speech apps will expect such function in the speech HTML.
>>
>> This 'speech' attribute can also be used to support the speech input
>> which doesn't have any visible UI counterpart as described in U6.
>> When an element's type is specified with the speech attribute
>> (i.e. type="speech"), then the UA will take action(s) listed in
>> the element upon SS's matching respond for the element, based on
>> the expected utterance (using the value attribute) or a grammar
>> (using new attribute, grammar) for the element:
>>
>>    <input type="speech" value="example" prompt="Say example"../>
>> or,
>>    <input type="speech" grammar="example.grxml" prompt="Say
example"../>
>>
>> Another use case is to support voice navigation within web-pages:
>>
>>    <input type="speech" value="home page"
>> onmatch="goToPage('home.html')"../>
>>
>> Again, this can be handled by JS objects, but it'd be much more
>> effective and productive to present such process at the HTML layer
>> along with other ordinary menu based navigation.
> 
> I'm not at all sure about this. Speech modality has different behaviors
> (for example the fill-multiple-fields-at-once) than the visual/keyboard.
> 
> 
> -Olli

Understood. That different modality is double-edged as it provides
the power of presenting multiple information in a single utterance,
but then it creates the complexity of recognition/interpretation.
Speech app developers will expect a well-defined framework in the
Speech HTML to deal with this complexity.

For instance, when the SS recognizes a speech-only input type, then
the UA just takes the action specified as if it's entered from other 
modality. In order to avoid ambiguity (i.e. a recognized word(s) could
belonged to multiple HTML input fields), the grammar must be pre-compiled
and verified by the UA for given HTML page before sending over to the SS.

As far as supporting multiple input fields, the developer may find it
useful if the UA allows grouping the related input fields into a <table> 
tag (with some attribute extension), and then the UA itself merges the
individual grammars from the each separate fields to produce a
'table-grammar' to recognize a MI (Mixed Initiative) mode utterance.
Here, the grammar needs to be annotated to bind each interpretation
to one of the fields. The 'table-grammar' must also support recognizing 
utterance for each individual fields. It'd be nice to have the UA to 
support this kind of services in v1.

Chan
Received on Monday, 6 December 2010 02:15:32 UTC