Re: Agreed recognition API? from Olli Pettay on 2011-05-19 (public-xg-htmlspeech@w3.org from May 2011)

From: Olli Pettay <Olli.Pettay@helsinki.fi>
Date: Thu, 19 May 2011 18:39:56 +0300
To: Bjorn Bringert <bringert@google.com>
CC: public-xg-htmlspeech@w3.org
Message-ID: <4DD539CC.4080002@helsinki.fi>
On 05/19/2011 05:59 PM, Bjorn Bringert wrote:
> By now the draft final report
> (http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech.html)
> contains a number of design agreements for the JavaScript API for
> speech recognition. I thought it would be a useful exercise to
> translate those agreements into a concrete API.
>
> The below IDL describes my interpretation of the parts of the API that
> we have agreed on so far. Many of the interface/function/attribute
> names are not yet agreed, so I mixed and matched from the Microsoft,
> Mozilla and Google proposals.
>
> interface SpeechInputRequest {
>     // URL (http: or data:) for an SRGS XML document, with or without SISR tags,
>     // or a URI for one of the predefined grammars
>     attribute DOMString grammar;

I think we need to support either multiple simultaneous grammars or
SIRs. MS has GrammarCollection, so it supports multiple grammars, 
SpeechRequest API support multiple active recognition objects.



>     // Recognition language. Language declared in grammar overrides this.
>     attribute DOMString lang;

I wonder still how to handle language in a don't-leak-privacy-data way.
There are very good use cases for lang, but the privacy problem should 
be solved.


>     // URL for speech recognition engine, http: must be supported
>     attribute DOMString engine;
>
>     // Not yet discussed I think, but Google and Microsoft proposals have it
>     attribute long maxresults;
Very reasonable.

>
>     // Some timeout parameters will likely be agreed, not yet discussed
ditto

>
>      // Starts capturing audio and recognizing speech
>      void startSpeechInput();
>      // Stops capturing audio and lets speech recognition complete
>      void stopSpeechInput();
>      // Stops capuring audio and aborts speech recognition
>      void cancelSpeechInput();
>
>      attribute Function onaudiostart;
>      attribute Function onsoundstart;
>      attribute Function onspeechstart;
>      attribute Function onspeechend;
>      attribute Function onsoundend;
>      attribute Function onaudioend;
>      attribute Function onresult;
>      attribute Function onerror;
> };
> SpeechInputRequest implements EventTarget;
>
> Events:
>
> audiostart, interface: Event: Audio capture has started
> soundstart, interface: Event: Some sound, possibly speech, has been
> detected (low latency)
> speechstart, interface: Event: Speech start has been detected
> speechend, interface: Event: Speech end has been detected (hmm, can we
> really guarantee that this comes before soundend if the latter is a
> client endpointer)
> soundend, interface: Event: Sound end has been detected
> audioend, interface: Event: Audio capture has finished
> result, interface: SpeechResultEvent: Speech recognizer has returned a
> final result with at least one recognition hypothesis
> error, interface: Event (?): Speech end has been detected
>
> // The event passed to the 'result' event handlers
> interface SpeechResultEvent : Event {
>      readonly attribute SpeechInputResult result;
> };
>
> // Recognition result as EMMA + simple N-best list
> interface SpeechInputResult {
>      readonly attribute Document resultEMMAXML;
>      readonly attribute DOMString resultEMMAText;
>      readonly attribute unsigned long length;
>      getter SpeechInputResultAlternative item(in unsigned long index);
> };
>
> // Item in N-best list
> interface SpeechInputAlternative {
>      readonly attribute DOMString utterance;
>      readonly attribute float confidence;
>      readonly attribute any interpretation;
> };
>
>
> The HTML interface has not been agreed yet. The Mozilla proposal has
> none. The Microsoft proposal has a<reco>  element as a child of
> <input>  and other elements, or associated with elements using @for.
> Google has @speech attribute for<input>  elements. If we agree on
> speech recognition element(s), the SpeechInputRequest interface should
> either be able to serve as the DOM interface for such elements, or the
> elements could have an attribute which contains a SpeechInputRequest.

My proposal has the boundElement which has some similarity to MS' @for, 
but sure, it doesn't have any element for the asr/reco, only JS object.



-Olli
Received on Thursday, 19 May 2011 15:40:24 UTC