Re: R23. Speech as an input on any application should be able to be optional from Olli Pettay on 2010-12-09 (public-xg-htmlspeech@w3.org from December 2010)

From: Olli Pettay <Olli.Pettay@helsinki.fi>
Date: Thu, 09 Dec 2010 13:49:24 +0200
To: chan@info-cast.com
CC: Deborah Dahl <dahl@conversational-technologies.com>, Bjorn Bringert <bringert@google.com>, Dan Burnett <dburnett@voxeo.com>, public-xg-htmlspeech@w3.org
Message-ID: <4D00C244.4050101@helsinki.fi>
On 12/06/2010 04:15 AM, chan@info-cast.com wrote:
>>>> >>>  On 11/23/2010 04:24 AM,chan@info-cast.com  wrote:
>>>>> >>>>  Hello Deborah,
>>>>> >>>>
>>>>> >>>>  OK, if the speech is optional for that "type=speech" element,
>>>>> >>>>  then text (or other modality?) is assumed here ?
>>>>> >>>>  Or the element won't get any input other than speech ?
>>>>> >>>>
>>>>> >>>>  What we actually need is an element accepting mulitimodal
>>>>> >>>>  input, assuming both text and speech agents up and running
>>>>> >>>>  for that element simultanesously. Wonder if this use case
>>>>> >>>>  had been discussed before - my apology if it's been,
>>>>> >>>>  as I started following your standard efforts quite lately.
>>>> >>>
>>>> >>>
>>>> >>>  It is not at all clear that we should add any kind of element.
>>>> >>>  What I'm probably going to propose is an JS object, which controls
>>>> >>>  speech recognition and which is the entry point for
>>>> >>>  speech recognition related DOM event stream.
>>>> >>>
>>>> >>>  Then it is up to the web app (or script libraries) to
>>>> >>>  handle multimodal integration; this way we don't
>>>> >>>  limit multimodal input to speech+keyboard, but also mouse/touch
>>>> >>>  /whatever events can be handled easily.
>>>> >>>
>>>> >>>  -Olli
>>>> >>>
>>> >>
>>> >>  Don't have to add a new element to support multimodality elements,
>>> >>  but expanding the element's type attribute would be enough.
>>> >>  To add speech, the input tag's type attribute may be expanded:
>>> >>
>>> >>      <input type="text|speech" ../>
>> >
>> >  That would be backwards incompatible change, if you want to
>> >  support any other type than "text". And note, binding
>> >  speech input to one html input element isn't enough
>> >  in the common case when user wants to fill several fields.
>> >
>> >
> Well, I thought that the Speech HTML assumes a certain degree of
> backward compatibility as the extension may have to introduce new
> types and attributes. Since the speech is clearly a distinct input
> type, it may deserve its own type attribute.

Or perhaps it is better to have a new element or new JS object
and then using some very simple scripting, copy the recognition result
to the relevant input element.


>
> And I think there's a way to deal with this incompatibility, so that
> the UA can safely ignore those new extensions. Interestingly enough,
> the current web browsers don't even care about this 'speech' extension.
> I found that they just ignore the extension to behave correctly even
> w/o any compatibility check.

If you tested "text|speech", then yes, per HTML5 that is interpreted as
"text". But input element has many other types, and also those would
need to support speech. So, I believe, it is easier to have some
speech specific element or JS object from which the recognition
result can be taken. That way support for fill-multiple-fields-at-once
can be handled the same way as just one field.


>
>>> >>
>>> >>  where users can enter the desired info with either text or speech.
>>> >>  And the user must be able to use the speech modality in hands-free
>>> >>  mode, as required by R24. This implies both text and speech channels
>>> >>  must be active simultaneously to acquire valid input from either of
>>> >>  the channels.
>>> >>
>>> >>  Implementing and maintaining such dual channel monitoring process
>>> >>  could be a complex task for average web developers, even when a JS
>>> >>  library object becomes available.
>> >  Average web developer would use a scrip library for multimodal
> integration.
>> >
>> >
>> >
>>> >>  Since this process is well-defined
>> >  Is it really well-defined? I do assume the process depends
>> >  on the web application and the interaction model the
>> >  web application provides.
>> >
>> >
> The process I referred to was the very basic mechanism of:
>
>    1. Sampling utterance
>    2. Transmitting to SS
>    3. Getting the result back - either recognition pass | fail
>    4. Presenting the pass-result on the screen, as if it's typed-in
>       (taking specified action for failure case)

4. is just one use case. There are plenty of other cases when
the result doesn't need to be presented on the screen
("zoom in", "pan right"), or it needs to be presented in several
places (fill-multiple-fields-at-once).


>
> Other than this, those higher-level interpretation may better be
> dealt by the application. This is to support basic speech input in
> the simplest manner, with a large set of default configuration.
>
>>> >>  and necessary for all speech-enabled application, this seems to be more
>>> >>  of the UA responsibility, than a part of web application. Developers
>>> >>  engaged in speech apps will expect such function in the speech HTML.
>>> >>
>>> >>  This 'speech' attribute can also be used to support the speech input
>>> >>  which doesn't have any visible UI counterpart as described in U6.
>>> >>  When an element's type is specified with the speech attribute
>>> >>  (i.e. type="speech"), then the UA will take action(s) listed in
>>> >>  the element upon SS's matching respond for the element, based on
>>> >>  the expected utterance (using the value attribute) or a grammar
>>> >>  (using new attribute, grammar) for the element:
>>> >>
>>> >>      <input type="speech" value="example" prompt="Say example"../>
>>> >>  or,
>>> >>      <input type="speech" grammar="example.grxml" prompt="Say
> example"../>
>>> >>
>>> >>  Another use case is to support voice navigation within web-pages:
>>> >>
>>> >>      <input type="speech" value="home page"
>>> >>  onmatch="goToPage('home.html')"../>
>>> >>
>>> >>  Again, this can be handled by JS objects, but it'd be much more
>>> >>  effective and productive to present such process at the HTML layer
>>> >>  along with other ordinary menu based navigation.
>> >
>> >  I'm not at all sure about this. Speech modality has different behaviors
>> >  (for example the fill-multiple-fields-at-once) than the visual/keyboard.
>> >
>> >
>> >  -Olli
> Understood. That different modality is double-edged as it provides
> the power of presenting multiple information in a single utterance,
> but then it creates the complexity of recognition/interpretation.
> Speech app developers will expect a well-defined framework in the
> Speech HTML to deal with this complexity.
>
> For instance, when the SS recognizes a speech-only input type, then
> the UA just takes the action specified as if it's entered from other
> modality. In order to avoid ambiguity (i.e. a recognized word(s) could
> belonged to multiple HTML input fields), the grammar must be pre-compiled
> and verified by the UA for given HTML page before sending over to the SS.
Why does the grammar need to be pre-compiled and/or verified?
Web app just gets some recognition result back and then it is up to the
web app to use the result somehow.


>
> As far as supporting multiple input fields, the developer may find it
> useful if the UA allows grouping the related input fields into a<table>
> tag (with some attribute extension), and then the UA itself merges the
> individual grammars from the each separate fields to produce a
Would be very strange to force web app author to use <table> in
case of multiple input fields.


-Olli
Received on Thursday, 9 December 2010 11:50:28 UTC