Re: speech API proposal (from Microsoft) from Olli Pettay on 2011-03-08 (public-xg-htmlspeech@w3.org from March 2011)

From: Olli Pettay <Olli.Pettay@helsinki.fi>
Date: Tue, 08 Mar 2011 22:03:09 +0200
To: Robert Brown <Robert.Brown@microsoft.com>
CC: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <4D768B7D.60605@helsinki.fi>

On 03/01/2011 07:11 PM, Robert Brown wrote:
> Hi Everyone,
>
> Our proposal is posted here:
> http://lists.w3.org/Archives/Public/www-archive/2011Mar/att-0001/microsoft-api-draft-final.html
>
>
> It proposes some extensions to existing APIs, as well as some
> speech-specific objects, and some speech-specific HTML.
>
> Cheers,
>
> /Rob
>

Some comments.

Chapter 6.
Have you investigated if HTML <device> could be used, instead of Capture 
API? There is a possible practical/political problem using
Capture API; it is a draft from DAP WG which major browser vendors, 
except Opera, have left. (I don't recall all the reasons why that 
happened last year.)

It might be more flexible to use WebSockets than XHR.
I doubt send(in Stream) will be ever accepted to XHR.
XHR is getting rather complicated API even without streaming,
and WebSocket is all about Streaming.

The proposed change to XHR+multipart isn't what is implemented today
in some browsers.

In general, I'd prefer some simpler (for web developers) solution than
Capture API + XHR. Also, this approach would require major changes
to other APIs. And the approach doesn't allow local speech engines.

Chapter 7.
I like GrammarCollection. (minor nit, for consistency with other web 
API, it should have length, not count.)

Nit, SetInputDevice -> setInputDevice

Is it possible to use SpeechRecognizer without CaptureAPI?

Why recognizer is set in SpeechRecognizer constructor, but
capture device needs a separate method?

Nit, event listener attributes should be attribute Function onfoo, not 
onfoo()

Not very surprisingly, in general I like SpeechRecognizer approach.
Using XHR could be removed, and Capture API and remote speech services 
could be supported in a v2 (assuming Capture API is even close to stable).

Chapter 7.4
I think I prefer Björn's approach to reuse <audio> for tts, especially
because SpeechSynthesizer looks a lot like the API for <audio>.
Or alternatively <tts> from 8.2.

Chapter 8
Assuming <reco> doesn't have any visual representation, I think
I could prefer that approach. Especially if the API is simplified a bit
(maybe remove .capture and  SetSpeechService from v1 ).
  In a way it is very close to
SpeechRequest, which has aBoundElement as a parameter to constructor. 
<reco>'s "bound element" would be the parent element. Although, there is a
parsing problem. IIRC HTML5 parser does not ever create a child element
for <input> element. So, the 'for' attribute approach would work better.

Chapter 8.2
HTMLTTSElement could perhaps extend HTMLAudioElement, or at least 
HTMLMediaElement.

So far SpeechRequest/<reco> for recognition and
<tts> (either MS's or Google's) for tts looks most promising
to me.

I haven't yet read Tropo document properly, but it doesn't
"feel" very webby, and uses terms like Event in a different way
than web specs. But I'll comment it more later.

-Olli

Received on Tuesday, 8 March 2011 20:03:43 UTC