RE: An early draft of a speech API from Robert Brown on 2011-03-14 (public-xg-htmlspeech@w3.org from March 2011)

From: Robert Brown <Robert.Brown@microsoft.com>
Date: Mon, 14 Mar 2011 18:41:41 +0000
To: "Olli@pettay.fi" <Olli@pettay.fi>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <113BCF28740AF44989BE7D3F84AE18DD198751F2@TK5EX14MBXC118.redmond.corp.microsoft.>

Olli, I know you were busy and weren't able to put as much time into this as you wanted to.  But the IDL is mostly self-explanatory. 

Here are my thoughts:

I think the factoring of "RecognitionResult[] results" isn't quite right.  Each element in the array has its own copy of the full EMMA document.  The EMMA document should stand aside from the array.

Binding to DOMNode is too generic.  Realistically, there are only certain types such as <input> and <textarea> where it would make sense to do this.  And even then, the semantics of binding to these specific elements aren't obvious.  Do the values of type and pattern imply a grammar?   If so, what's the algorithm for generating that grammar?  How does this interact with any SRGS grammar the developer also supplied? How does it interact with <label>?  How is it supposed to work intuitively for radio buttons and check boxes given the extremely loose syntax for including those in a group?  Etc.  This looks really nice when written in IDL, but the actual spec will be very hard to write if we want it to be intuitive to the developer. (We like the idea, but the answers to these questions weren't obvious to us, which is one of the reasons we made it a third priority in our proposal)

The start & stop semantics combine microphone control with recognizer control.  It would be more flexible to separate these.  Or if you really want to bundle these concepts together, try both a method called "stoplistening()" and one called "stoprecognizing()" or whatever.  Maybe if there wasn't already microphone work being done, conflating the two ideas would make sense.   But as things stand, if we conflated them now, we'd just have to unravel them in the near future (or have two parallel microphone APIs, with their own consent UX's, etc, <yuk>).

Not sure if one request at a time is the right approach.  For example: you could have one request listening for navigation commands all the time, while another just listens for a specific grammar in a certain context.  Or: you could have a local recognizer listening for simple commands ("call Fred"), and a cloud-service listening with a huge SLM ("what's the supermarket closest to Fred"), both at the same time.  Besides, two different pages displayed side-by-side could certainly reco at the same time.  So why not two different things in the same page?  Also, the logic you're describing for automatically deactivating and reactivating seems very biased toward a certain scenario.  In my opinion, it's much better to let the developer decide when to activate and deactivate.


-----Original Message-----
From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Olli Pettay
Sent: Monday, February 28, 2011 12:38 PM
To: public-xg-htmlspeech@w3.org
Subject: An early draft of a speech API

Hi all,

here is what I had in mind for speech API.
The text misses still lots of definitions, but I hope it is still somewhat clear how it should work.
(Getting Firefox 4 done has taken most of my time.)

The main difference to the Google's API is that this isn't based on elements, but requests objects.

For TTS we could probably use something close to what Björn just proposed.



-Olli

Received on Monday, 14 March 2011 18:42:17 UTC