Re: An early draft of a speech API from Olli Pettay on 2011-03-14 (public-xg-htmlspeech@w3.org from March 2011)

From: Olli Pettay <Olli.Pettay@helsinki.fi>
Date: Mon, 14 Mar 2011 21:48:02 +0200
To: Robert Brown <Robert.Brown@microsoft.com>
CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <4D7E70F2.7080205@helsinki.fi>
On 03/14/2011 08:41 PM, Robert Brown wrote:
> Olli, I know you were busy and weren't able to put as much time into
> this as you wanted to.  But the IDL is mostly self-explanatory.
>
> Here are my thoughts:
>
> I think the factoring of "RecognitionResult[] results" isn't quite
> right.  Each element in the array has its own copy of the full EMMA
> document.  The EMMA document should stand aside from the array.
Very true :)


>
> Binding to DOMNode is too generic.  Realistically, there are only
> certain types such as<input>  and<textarea>  where it would make
> sense to do this.
Really? What about contentEditable? designMode?
What about selecting something in <select> using speech?
What about page level speech input?.
The main reason for BoundElement is that UA can use that to
show some hint to user which part of the page wants to get the
speech input. In some cases the input is for the whole page (navigation 
for example),  some cases field level or form level or whatever.


   And even then, the semantics of binding to these
> specific elements aren't obvious.  Do the values of type and pattern
> imply a grammar?   If so, what's the algorithm for generating that
> grammar?  How does this interact with any SRGS grammar the developer
> also supplied? How does it interact with<label>?  How is it supposed
> to work intuitively for radio buttons and check boxes given the
> extremely loose syntax for including those in a group?  Etc.  This
> looks really nice when written in IDL, but the actual spec will be
> very hard to write if we want it to be intuitive to the developer.
> (We like the idea, but the answers to these questions weren't obvious
> to us, which is one of the reasons we made it a third priority in our
> proposal)

Yeah, the internal grammars would be hard to define. The initial
implementation would probably just use external grammar files, and
perhaps allow dictation.


>
> The start&  stop semantics combine microphone control with recognizer
> control.  It would be more flexible to separate these.  Or if you
> really want to bundle these concepts together, try both a method
> called "stoplistening()" and one called "stoprecognizing()" or
> whatever.  Maybe if there wasn't already microphone work being done,
> conflating the two ideas would make sense.   But as things stand, if
> we conflated them now, we'd just have to unravel them in the near
> future (or have two parallel microphone APIs, with their own consent
> UX's, etc,<yuk>).
Well that is mainly UA problem. And it is still  a bit unclear what the 
microphone API will look like.
But sure, this all depends on whether some separate microphone API
will be used. By default speech API shouldn't require microphone API,
I think, and the implementation should handle opening and closing
microphone automatically.



>
> Not sure if one request at a time is the right approach.
That is just the default case. The API allows several requests at a 
time. There is even a simple example to use field level and
page level request.



-Olli


   For
> example: you could have one request listening for navigation commands
> all the time, while another just listens for a specific grammar in a
> certain context.  Or: you could have a local recognizer listening for
> simple commands ("call Fred"), and a cloud-service listening with a
> huge SLM ("what's the supermarket closest to Fred"), both at the same
> time.  Besides, two different pages displayed side-by-side could
> certainly reco at the same time.  So why not two different things in
> the same page?  Also, the logic you're describing for automatically
> deactivating and reactivating seems very biased toward a certain
> scenario.  In my opinion, it's much better to let the developer
> decide when to activate and deactivate.
>
>
> -----Original Message----- From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Olli
> Pettay Sent: Monday, February 28, 2011 12:38 PM To:
> public-xg-htmlspeech@w3.org Subject: An early draft of a speech API
>
> Hi all,
>
> here is what I had in mind for speech API. The text misses still lots
> of definitions, but I hope it is still somewhat clear how it should
> work. (Getting Firefox 4 done has taken most of my time.)
>
> The main difference to the Google's API is that this isn't based on
> elements, but requests objects.
>
> For TTS we could probably use something close to what Björn just
> proposed.
>
>
>
> -Olli
>
>
Received on Monday, 14 March 2011 19:48:35 UTC