W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > March 2011

RE: speech API proposal (from Microsoft)

From: Robert Brown <Robert.Brown@microsoft.com>
Date: Fri, 11 Mar 2011 20:57:09 +0000
To: "Young, Milan" <Milan.Young@nuance.com>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <113BCF28740AF44989BE7D3F84AE18DD19873FA6@TK5EX14MBXC118.redmond.corp.microsoft.com>
[Also replying to Olli's comments http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Mar/0007.html to consolidate the thread]

Thanks to both of you for your feedback.

Section 6.1:

<<Olli>> Have you investigated if HTML <device> could be used, instead of Capture API?

We think the <device> API has too much ambiguity and needs a lot more work.  Capture looks like the better path today if we want interoperable implementations.

<<Milan>>But wouldn't it be more forward looking to use the speech-aware variants for the capture API as well?

If you mean we should use consistent naming, then yes.

<<Milan>> I didn't quite understand where the endpointing would take place.  Is this implemented by the UA?

Yes, we're proposing that at a minimum, the UA be capable of endpointing, which the developer can optionally take advantage of.  This maximizes flexibility considering the broad space of potential network & service options and constraints that applications will be built for.  Some recognition services will also end-point, and developers can rely on that if they like.

Section 6.2:

<<Milan>> I'm interested to know your reasoning for choosing XHR2 over WebSockets.
<<Olli>> It might be more flexible to use WebSockets than XHR.

Why XHR2?  It's a super successful, easy, and familiar API, with an underlying architecture that works great for speech right now.  It's very easy to program against, both at the client and the server.  If a speech technology vendor wanted to support it, the work is minimal.  And you saw from the code samples that it's really straight forward to script from the app too.  In short: it's practical, it works, it's completely in line with the way web developers program now.  Why exclude it?

Why not WebSockets?  We're not opposed to this.  It's a promising concept, and the teething pains around security will get sorted out eventually.  We just don't have a proposal.  One of you should write one :-) I'm serious.  You both seem to have a deep hunch you want to follow, but it's hard to discuss something non-concrete.

<<Olli>> I doubt send(in Stream) will be ever accepted to XHR.

We disagree on this one.  XHR already has send(Blob), and it already streams over HTTP.  Seems like a no-brainer to connect the dots on this.  Tell us what we're missing.

<<Olli>> The proposed change to XHR+multipart isn't what is implemented today in some browsers.

Sounds like you have a suitable existing spec in mind.   Can you provide a link?  We care more about the principle of multipart than the specific API.

<<Olli>>  the approach doesn't allow local speech engines.

True.  That's chapter 7.

Section 6.4

<<Milan>> What do you think about using the MRCP RECOGNIZE method as the base format instead of inventing something so similar?

Certainly worth exploring.

Section 7

<<Olli>> Various nits.

You're good at proof reading code :-).  No push back on these - they make sense.
<<Olli>> I like GrammarCollection.
<Olli>> Not very surprisingly, in general I like SpeechRecognizer approach.

Thanks.

<<Olli>> Is it possible to use SpeechRecognizer without CaptureAPI

Here's what we wrote in 7.1: "By default, microphone input will be provided by the user agent's default device. However, the application may use the SetInputDevice() functions to provide a particular Capture object, with particular configuration settings, in order to exercise more control over the recording operation."  So, yes, you can do it without CaptureAPI, but you get less control.  We didn't want to duplicate the capture API in the speech reco API

<<Olli>> Using XHR could be removed, and Capture API and remote speech services could be supported in a v2 (assuming Capture API is even close to stable).

We're *very* skeptical of the notion of deferring these problems.  Access to remote services ranked very highly in the requirements voting, and it would be remiss for the XG to make a recommendation that didn't support this.  If there's no open way speech technology vendors can provide services to apps running in any browser, we've failed.  Most of the participants in this XG will be unable to provide services to developers and users without tackling this problem.

Likewise, we know from experience building lots of speech apps that microphone control is critical.  I don't think we get to not make a recommendation on this.

Section 8

<<Olli>> IIRC HTML5 parser does not ever create a child element for <input> element. So, the 'for' attribute approach would work better.

You're right about the parsing of <input> elements.  We noted the 'for' approach as an option if the child approach seems too radical.

<<Olli>> HTMLTTSElement could perhaps extend HTMLAudioElement, or at least HTMLMediaElement.

This was our initial thought, and we were enamored with Bjorn's proposal along these lines last November.  Our main concern was that the media elements also have semantics that don't mean anything to TTS.  What do <track> and TextTrack mean?  Why would TTS apps use <source> elements?  What does it mean to set controls = true?  In the end, we felt it was better to have something that was consistent with the media elements, where it made sense, but didn't have the baggage of semantics that didn't make sense.

Section 9

<<Milan>>You gave "implementation-specific events" one star.  But I couldn't figure out how this would work at all.

The star is for TTS.  You could do some creative things in the HTTP response to include some TTS events (e.g. we proposed something workable for mark events).  Not great, but worth a star.  No stars for SR though :(



From: Young, Milan [mailto:Milan.Young@nuance.com]
Sent: Tuesday, March 08, 2011 9:05 PM
To: Robert Brown; public-xg-htmlspeech@w3.org
Subject: RE: speech API proposal (from Microsoft)


Hello, a few comments:





6.1

- I saw that you used standard VoiceXML parameter names for the timeouts in section 6.4 (eg incompletetimeout), but you invented non-speech aware versions for the capture API.  But wouldn't it be more forward looking to use the speech-aware variants for the capture API as well?  One day the endpointer may wish to integrate to the recognizer.



- I didn't quite understand where the endpointing would take place.  Is this implemented by the UA?





6.2

- I'm interested to know your reasoning for choosing XHR2 over WebSockets.



- I've been listening in on the RTW mailing list.  They are discussing sending RTP from the browser so this might be another longer term option.  The RTW guys are proposing to solve the DOS security problem by an authentication protocol at the start of the session.  It would involve the browser, server, and all routers in the path.





6.4

- What do you think about using the MRCP RECOGNIZE method as the base format instead of inventing something so similar?

- JSON is pretty usually preferred on the Web over XML.  What do you think about allowing multiple EMMA presentation types?





9

- You gave "implementation-specific events" one star.  But I couldn't figure out how this would work at all.





Thank you


________________________________
From: public-xg-htmlspeech-request@w3.org [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Robert Brown
Sent: Tuesday, March 01, 2011 9:11 AM
To: public-xg-htmlspeech@w3.org
Subject: speech API proposal (from Microsoft)

Hi Everyone,

Our proposal is posted here: http://lists.w3.org/Archives/Public/www-archive/2011Mar/att-0001/microsoft-api-draft-final.html

It proposes some extensions to existing APIs, as well as some speech-specific objects, and some speech-specific HTML.

Cheers,

/Rob
Received on Friday, 11 March 2011 20:57:46 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:16:49 UTC